CodePDF Available

Q Learning generic

Authors:
  • Eni SpA - San Donato Milanese (MI)

Abstract

Q-LEARNING tutorial
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 1/4
TUTORIAL CODE: GENERIC WORKFLOW FOR Q-LEARNING
This workow (and associated code) has been extracted from a tutorial (see reference below)
about Q-Learning and re-adaped by me for the specic purpose of Policy Optimization in Oil
Production. It represents a simple and didactical example of Q-Learning. The idea is to assign
negative rewards to state transitions that are not admitted. In the case of Oil Production
optimization, this happens for instance when an increase in oil production implies high risk of water
intrusion in the production well(s). For instance, when we see the value of -1 in the R Matrix, it
indicates a production state that implies a signicant increase in the risk of water breakthrough.
Instead, high scores in the R Matrix indicate transitions in which we have high oil production
increase, without signicant risk increase. In the example here reported, you can set the input R
values, assuming that each one of them corresponds to a precise state/action transition, in terms
of production and observed variations of water distance from the well. Then you can test the Q-
Learning approach through all the steps below, in order to verify how the Q-Marix is updated through
the training. Finally, you can verify what is the optimal sequence of steps (optimal policy) for
achieving the target step (that you can x as well).
import numpy as np
# R matrix
R = np.matrix([ [-1,-1,-1,-1,25,-1],
[-1,-1,-1,0,-1,25],
[-1,-1,-1,0,-1,-1],
[-1,50,0,-1,50,-1],
[-1,0,0,-1,-1,100],
[-1,0,-1,-1,0,100] ])
# Q matrix
Q = np.matrix(np.zeros([6,6]))
# Gamma (learning parameter).
gamma = 0.8
# Initial state. (Usually to be chosen at random)
initial_state = 1
# This function returns all available actions in the state given as an argument
def available_actions(state):
current_state_row = R[state,]
av_act = np.where(current_state_row >= 0)[1]
return av_act
# Get available actions in the current state
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 2/4
Get a a ab e act o s t e cu e t state
available_act = available_actions(initial_state)
# This function chooses at random which action to be performed within the range
# of all the available actions.
def sample_next_action(available_actions_range):
next_action = int(np.random.choice(available_act,1))
return next_action
# Sample next action to be performed
action = sample_next_action(available_act)
# This function updates the Q matrix according to the path selected and the Q
# learning algorithm
def update(current_state, action, gamma):
max_index = np.where(Q[action,] == np.max(Q[action,]))[1]
if max_index.shape[0] > 1:
max_index = int(np.random.choice(max_index, size = 1))
else:
max_index = int(max_index)
max_value = Q[action, max_index]
# Q learning formula
Q[current_state, action] = R[current_state, action] + gamma * max_value
# Update Q matrix
update(initial_state,action,gamma)
#-------------------------------------------------------------------------------
# Training
# Train over 10 000 iterations. (Re-iterate the process above).
for i in range(10000):
current_state = np.random.randint(0, int(Q.shape[0]))
available_act = available_actions(current_state)
action = sample_next_action(available_act)
update(current_state,action,gamma)
# Normalize the "trained" Q matrix
print("Trained Q matrix:")
print(Q/np.max(Q)*100)
#-------------------------------------------------------------------------------
# Testing
# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5
current_state = 2
steps = [current_state]
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 3/4
while current_state != 5:
next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
if next_step_index.shape[0] > 1:
next_step_index = int(np.random.choice(next_step_index, size = 1))
else:
next_step_index = int(next_step_index)
steps.append(next_step_index)
current_state = next_step_index
# Print selected sequence of steps
print("Selected path:")
print(steps)
Trained Q matrix:
[[ 0. 0. 0. 0. 85. 0. ]
[ 0. 0. 0. 72. 0. 85. ]
[ 0. 0. 0. 72. 0. 0. ]
[ 0. 78. 57.6 0. 90. 0. ]
[ 0. 68. 57.6 0. 0. 100. ]
[ 0. 68. 0. 0. 80. 100. ]]
Selected path:
[2, 3, 4, 5]
REF: http://rsttimeprogrammer.blogspot.com/2016/09/getting-ai-smarter-with-q-learning.html
25/1/2021 Q_Learning_generic.ipynb - Colaboratory
https://colab.research.google.com/drive/1pk0zWBrAU_f8RgqSJmohALA6of5zigrc#scrollTo=GlNy7dz82jCH&printMode=true 4/4

File (1)

ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.