Content uploaded by Paolo Dell’Aversana
Author content
All content in this area was uploaded by Paolo Dell’Aversana on Jan 25, 2021
Content may be subject to copyright.
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 1/4
TUTORIAL CODE: GENERIC WORKFLOW FOR Q-LEARNING
This workow (and associated code) has been extracted from a tutorial (see reference below)
about Q-Learning and re-adaped by me for the specic purpose of Policy Optimization in Oil
Production. It represents a simple and didactical example of Q-Learning. The idea is to assign
negative rewards to state transitions that are not admitted. In the case of Oil Production
optimization, this happens for instance when an increase in oil production implies high risk of water
intrusion in the production well(s). For instance, when we see the value of -1 in the R Matrix, it
indicates a production state that implies a signicant increase in the risk of water breakthrough.
Instead, high scores in the R Matrix indicate transitions in which we have high oil production
increase, without signicant risk increase. In the example here reported, you can set the input R
values, assuming that each one of them corresponds to a precise state/action transition, in terms
of production and observed variations of water distance from the well. Then you can test the Q-
Learning approach through all the steps below, in order to verify how the Q-Marix is updated through
the training. Finally, you can verify what is the optimal sequence of steps (optimal policy) for
achieving the target step (that you can x as well).
import numpy as np
# R matrix
R = np.matrix([ [-1,-1,-1,-1,25,-1],
[-1,-1,-1,0,-1,25],
[-1,-1,-1,0,-1,-1],
[-1,50,0,-1,50,-1],
[-1,0,0,-1,-1,100],
[-1,0,-1,-1,0,100] ])
# Q matrix
Q = np.matrix(np.zeros([6,6]))
# Gamma (learning parameter).
gamma = 0.8
# Initial state. (Usually to be chosen at random)
initial_state = 1
# This function returns all available actions in the state given as an argument
def available_actions(state):
current_state_row = R[state,]
av_act = np.where(current_state_row >= 0)[1]
return av_act
# Get available actions in the current state
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 2/4
Get a a ab e act o s t e cu e t state
available_act = available_actions(initial_state)
# This function chooses at random which action to be performed within the range
# of all the available actions.
def sample_next_action(available_actions_range):
next_action = int(np.random.choice(available_act,1))
return next_action
# Sample next action to be performed
action = sample_next_action(available_act)
# This function updates the Q matrix according to the path selected and the Q
# learning algorithm
def update(current_state, action, gamma):
max_index = np.where(Q[action,] == np.max(Q[action,]))[1]
if max_index.shape[0] > 1:
max_index = int(np.random.choice(max_index, size = 1))
else:
max_index = int(max_index)
max_value = Q[action, max_index]
# Q learning formula
Q[current_state, action] = R[current_state, action] + gamma * max_value
# Update Q matrix
update(initial_state,action,gamma)
#-------------------------------------------------------------------------------
# Training
# Train over 10 000 iterations. (Re-iterate the process above).
for i in range(10000):
current_state = np.random.randint(0, int(Q.shape[0]))
available_act = available_actions(current_state)
action = sample_next_action(available_act)
update(current_state,action,gamma)
# Normalize the "trained" Q matrix
print("Trained Q matrix:")
print(Q/np.max(Q)*100)
#-------------------------------------------------------------------------------
# Testing
# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5
current_state = 2
steps = [current_state]
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 3/4
while current_state != 5:
next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
if next_step_index.shape[0] > 1:
next_step_index = int(np.random.choice(next_step_index, size = 1))
else:
next_step_index = int(next_step_index)
steps.append(next_step_index)
current_state = next_step_index
# Print selected sequence of steps
print("Selected path:")
print(steps)
Trained Q matrix:
[[ 0. 0. 0. 0. 85. 0. ]
[ 0. 0. 0. 72. 0. 85. ]
[ 0. 0. 0. 72. 0. 0. ]
[ 0. 78. 57.6 0. 90. 0. ]
[ 0. 68. 57.6 0. 0. 100. ]
[ 0. 68. 0. 0. 80. 100. ]]
Selected path:
[2, 3, 4, 5]
REF: http://rsttimeprogrammer.blogspot.com/2016/09/getting-ai-smarter-with-q-learning.html
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 4/4