Available via license: CC BY-NC-SA 4.0
Content may be subject to copyright.
Concept and the implementation of a tool to convert industry
4.0 environments modeled as FSM to an OpenAI Gym wrapper
Kallil M. C. Zielinski
kallil@alunos.utfpr.edu.br
UTFPR - Federal University of Technology Paraná
Pato Branco, PR, Brazil
Marcelo Teixeira
marceloteixeira@utfpr.edu.br
UTFPR - Federal University of Technology Paraná
Pato Branco, PR, Brazil
Richardson Ribeiro
richardsonr@utfpr.edu.br
UTFPR - Federal University of Technology Paraná
Pato Branco, PR, Brazil
Dalcimar Casanova
dalcimar@utfpr.edu.br
UTFPR - Federal University of Technology Paraná
Pato Branco, PR, Brazil
ABSTRACT
Industry 4.0 systems have a high demand for optimization in their
tasks, whether to minimize cost, maximize production, or even syn-
chronize their actuators to finish or speed up the manufacture of
a product. Those challenges make industrial environments a suit-
able scenario to apply all modern reinforcement learning (RL) con-
cepts. The main difficulty, however, is the lack of that industrial
environments. In this way, this work presents the concept and the
implementation of a tool that allows us to convert any dynamic
system modeled as an FSM to the open-source Gym wrapper. Af-
ter that, it is possible to employ any RL methods to optimize any
desired task. In the first tests of the proposed tool, we show tradi-
tional Q-learning and Deep Q-learning methods running over two
simple environments.
KEYWORDS
Industry 4.0 Environment Simulation, Reinforcement Learning, Q
learning, Deep Q network, Discrete Event Systems, Gym Wrapper,
Finite State Machines
1 INTRODUCTION
In the era of the fourth industrial revolution, frequently noted as
Industry 4.0, one key for Cyber-Physical Production Systems is
the ability to react adaptively to dynamic circumstances of pro-
duction processes [11][20][29]. This ability leads to the necessity
of developing acting components (actuators), like robots, for ex-
ample, with some capacity of self adaptation and learning, such
that they can modify their behavior according to the experience
acquired through interactions with each other and with the envi-
ronment. Actuators with such characteristics are called intelligent
agents [37].
This rises a question on how industry 4.0 components can be
programmed in such a way that they recognize variations in the en-
vironment, and autonomously adapt themselves to act accordingly,
in a concurrent, safe, flexible, customized, and maximally permis-
sive way. In conjunction, those features make hard the task of pro-
gramming such industrial controllers agents, as usual paradigms
for software development become inappropriate.
From the automation perspective, the behavior of a system can
be described by its evolution over time. When this evolution comes
from signals observed in physical equipment and devices, it usually
has an asynchronous nature over time, which ends defining how
this system can be represented by the model. The class of systems
that share this asynchronous characteristic is called Discrete Event
Systems (DESs) [5], whose modeling is in general based on Finite
State Machines (FSMs).
However, FSMs face significant limitations when modeling large
and complex dynamics with SEDs. Advanced features, such as con-
text recognition and switching, and multiphysics phenomena, are
difficult to be expressed by ordinary FSMs. They are usually as-
sociated with large and intricate models that not rarely have to
be built by hands, which challenges both modeling and process-
ing steps. With the increasing demand for flexibility, the ordinary
FSM-based modeling methods have become insufficient to express
emerging phenomena of industrial processes, such as dynamic con-
text handling [27].
This paper is based on the idea that an actuator agent, can sense
the environment and be controlled safely, as usual, by using the
conventional control synthesis methods. However, they can addi-
tionally complement their actions with an operating agent that ob-
serves the plant under control and gathers to it experiences about
its interaction with the environment. In other words, the observer
acts as a recommender for the events that are eligible to occur.
Based on the experiences calculated by the agent, it seeks to adapt
its behavior and control the components in such a way that de-
sired tasks are performed as intended, but in an optimized, adap-
tive, way.
This concept of adaptability in conjunction with large complex
industrial environments, whose predictability is practically non-
existent and the main objective is the optimization of results, makes
industrial environments ideal for applying model-free methods and
all modern RL concepts. These environments have a large number
of states, can be viewed as an Markov process and are oriented
towards a final objective.
However, the first difficulty in direct application of reinforce-
ment learning (RL) methods over the industry 4.0 is the lack of that
environments in order simulate the industry process. Although the
industrial process can be modeled as a DES, this model is not suit-
able for direct application of RL methods. To RL methods we need
a simulated environment where the agent can interact and learn,
explore new states and transitions in order to reach a good solution
for the desired objective.
2020-06-29 13:25. Page 1 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
The existence of several environments is a crucial point to suc-
cess in the application of RL methods, especially the deep based
ones, over different knowledge areas such as games [19], classic ro-
botics [13], natural language processing [17], computer vision [3],
among others. So, our main objective here is to provide the same
diversity of environments to industry 4.0, initially focused on the
coordination of industry components, but not limited to that.
In this way, this paper presents the concept and the implementa-
tion of a tool that allows to convert any dynamic system modeled
as a FSM to the open-source Gym environment [4]. The result is
a complex computational structure, obtained through simple, con-
densed and modular design, that is apt to receive modern RL treat-
ment. Once modeled as a Gym, them DES structure can be viewed
as a Markov Decicion Process (MDP) and processed by means of
RL [32], Q-learning [35], or any other Deep RL approach [19] with-
out lots of problem-specific engineering. The result is a complex
computational structure, obtained through simple, condensed and
modular design, that is apt to receive modern RL treatment.
Gym is a toolkit for developing and comparing reinforcement
learning algorithms. It makes no assumptions about the structure
of your agent, and is compatible with any numerical computation
library, such as Pytorch [22] or Tensorflow [1]. Once modeled as a
gym, the DES structure can be viewed as a Markov Decision Pro-
cess (MDP) and processed by means of RL [32], Qlearning [35], or
any other Deep RL approach [19] without lots of problem-specific
engineering.
Structurally, the manuscript is organized as follows: a brief lit-
erature overview is presented in Section 2; Section 3discusses the
background; Section 4compares details of RL and DES; Section 5
introduces the main results; a case study is presented in Section 6;
and some conclusions and perspectives are presented in Section 8.
2 STATE OF THE ART
Reinforcement learning applications in industry are nothing new.
However in [8] the author argues that implementing such meth-
ods in real industry environments often is a frustrating and tedious
process and academic research groups have only limited access to
real industrial data and applications. Despite the difficulties, sev-
eral works have been carried out. The work of [15] reviews appli-
cations both in robotics and in industry 4.0.
From a perspective of industrial environments the [8] it is the
work that presents an idea closer to the one proposed here in this
article. The authors designed a benchmark for the RL community
to attempt to bridge the gap between academic research and real
industrial problems. Its open source based on OpenAI Gym is avail-
able at https://github.com/siemens/industrialbenchmark. At Gym
official page [21] there is a few Gym third party environments that
resembles an industrial plant or robotics.
However, the general use and application of RL in industry are
still limited by the static, one-at-a-time, way in which the environ-
ment is modeled, with rewards and states being manually recon-
figured for each variation of the physical process. This approach
can be replaced, to some extent, by a DES-based strategy in order
to map those variations more efficiently.
In fact, while part of the behavior of industrial systems is clas-
sically continuous in time, the events happen in a discrete setting
and variations are essentially stochastic. Therefore, the ability for
mapping industrial processes as DESs can allow handling events
and variations more efficiently via RL. DES-based approaches has
been classically used in industry for both modeling and control
[25]. However, cyber-physical features of processes are still diffi-
cult to be captured by ordinary theories and tools for DES [27].
In this paper, we claim that any industrial environment that can
be modeled as a DES, can be automatically transformed in a Ope-
nAI Gym wrapper which bridges the gap between real world ap-
plications and RL area. We remark that, in most cases, components
of a DES can be described by simple, compact, and modular DES
models that can be combined automatically to represent the entire
system. Therefore, the task of obtaining the system model is never
actually a burden to be carried entirely by the designer, which can
be decisive in RL applications.
3 BACKGROUND
3.1 Emerging industrial systems
Intensive data processing, customized production, flexible control,
autonomous decisions. Much has been discussed about these asser-
tions in recent years, in the context of Industry 4.0 (I4.0).
I4.0 emerges from evolution of centralized production systems,
based on embedded microprocessors, to the distributed Cyber-Physical
Systems (CPSs) [6,7,16]. A CPS is responsible for the fusion of real
and virtual worlds, therefore indispensable link with modern sys-
tems. Technically, it integrates components or equipment with em-
bedded software, which are connected to form a single networked
system. This integration model leads to the acquisition and traffic
of a large volume of data, processed locally and made available to
other components via the Internet [2].
A CPS links production processes to technologies such as Big
Data,Internet of Things,Web Services, and Computational Intelli-
gence, which together allow the setup of highly flexible processes,
promisingly more efficient, cost-effective, and on demand for each
user profile [10]. In practice, the I4.0 principles disrupt the usual
production model consolidated so far, based on centralization and
large factories, creating decentralized, autonomous, interconnected,
and intelligent production chains that promise to support the in-
dustry of the future.
It s conceivable, however, to imagine how many possible obsta-
cles separate the I4.0 from its consolidation as a de facto industrial
revolution. Safety, interoperability, performance, risks, etc., impose
severe restrictions to its practice. Also the technical infrastructure
of a I4.0-based system is not necessarily compatible with the cur-
rent methods of control and automation, requiring an additional
level of integration.
The conversion method proposed in this paper can be seen as
a tool to improve interoperability among I4.0, process, and expert
methods. We focus on the event level of the process, which is dis-
cussed in the following.
3.2 The discrete nature of industrial systems
The modeling of a system is justified by several reasons, including
the fact that it is not always possible, or safe, act experimentally
over its real structure. Moreover, a model allows engineers to ab-
stract irrelevant parts of the system, facilitating its understanding.
2020-06-29 13:25. Page 2 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
If a system has a discrete nature and its transitions are driven by
sporadic events, it is called a Discrete Event System (DES).
DES is a dynamic system that evolves according to physical sig-
nals, named events, that occur in irregular and unknown time in-
tervals [5]. This contracts with dynamic systems that evolve con-
tinuously in time.
3.3 Formal background on SEDs
Differently from continuous-time systems, that can be naturally
modeled by differential equations, DESs are more naturally repre-
sented by Finite State Machines (FSMs). An FSM can be formally
introduced as a tuple 𝐺=hΣ, 𝑄 , 𝑞◦, 𝑄 𝜔,−→i where: Σis finite set
of events called alphabet;𝑄is a finite set of states;𝑄𝜔⊆𝑄is a
subset of marked states (in general associated with the idea of com-
plete tasks); 𝑞◦∈𝑄is the initial state; and −→ ⊆ 𝑄×Σ×𝑄is the
transition relation.
Sometimes, it is convenient to expose 𝐺in a usual graphical
convention, although this view may not be illustrative for large
models. Figure 1shows the graphical view of an FSM modeling a
simple machine with only two states.
𝑂 𝑓 𝑓 𝑂𝑛
𝛼
𝛽
Figure 1: Graphical layout of an FSM
In this case, a transition between any two states 𝑞, 𝑞 ′∈𝑄, with
the event 𝜎∈Σ, is denoted by 𝑞𝜎
−→ 𝑞′. In this example, 𝑄=
{𝑜𝑛, 𝑜 𝑓 𝑓 },Σ={𝛼 , 𝛽},𝑞◦=𝑜 𝑓 𝑓 , and 𝑄𝜔={𝑜 𝑓 𝑓 }means that
tasks are completed only when the machine is turned off, which in
this case coincides with the initial state. From now forward, FSMs
are exposed graphically or, when it is too large to layout, we men-
tion only its number of states and transitions, which are acceptable
measures for its dimensioning.
When a DES is formed by a set 𝐽={1,··· , 𝑚}of components,
which is quite often, each component can be modeled by a different
FSM 𝐺𝑗, 𝑗 ∈𝐽and combined afterward by synchronous composi-
tion. This allows the entire system to be designer modularly, which
can be decisive for large-scale systems. The result is a global, com-
bined, behavior (also called the plant), where all components work
simultaneously without any external restriction. For this reason,
the plant composition is also known as open-loop plant.
Consider two FSMs, 𝐺1=hΣ1, 𝑄 1,𝑞 ◦
1, 𝑄𝜔
1,−→1iand 𝐺2=hΣ2,
𝑄2, 𝑞◦
2, 𝑄𝜔
2,−→2i. The synchronous composition of 𝐺1and 𝐺2is de-
fined as 𝐺1k𝐺2=hΣ1∪Σ2, 𝑄 1×𝑄2,(𝑞◦
1, 𝑞 ◦
2), 𝑄𝜔
1×𝑄𝜔
2,−→i, where
elements in the set −→ satisfy the following conditions:
• (𝑞1, 𝑞2)𝜎
−→ (𝑞′
1, 𝑞′
2)if 𝜎∈Σ1∩Σ2, 𝑞1𝜎
−→ 𝑞′
1,and 𝑞2𝜎
−→ 𝑞′
2;
• (𝑞1, 𝑞2)𝜎
−→ (𝑞′
1, 𝑞2)if 𝜎∈Σ1\Σ1and 𝑞1𝜎
−→ 𝑞′
1;
• (𝑞1, 𝑞2)𝜎
−→ (𝑞1, 𝑞′
2)if 𝜎∈Σ2\Σ1and 𝑞2𝜎
−→ 𝑞′
2.
The synchronous compositions merges (synchronizes) events
shared between 𝐺1and 𝐺2, and it interleaves the others. A transi-
tion that does not follow any of these rules is said to be undefined,
which in practice means it is disabled.
3.4 Modeling the system
The first step to map a process as a DES is to identify which com-
ponents (or subsystems) are expected to be modeled. Subsystems
are then be individually modeled by an FSM and composed after-
ward. At this step, the designer observes only the constraint-free
behavior of components, disregarding details about their posterior
coordination.
Considering a DES formed by a set 𝐽={1,··· , 𝑚}of com-
ponents, so that each component is modeled by an FSM denoted
𝐺𝑗, 𝑗 ∈𝐽. Then
𝐺=k𝑗∈𝐽𝐺𝑗
is said to be the plant model.
3.4.1 Example of plant modeling. In order to illustrate the modu-
lar way a DES can be modeled, consider a simple example of two
transmitters, 𝑇1and 𝑇2, sharing a communication channel 𝐶, as in
Fig. 2(a).𝑇1and 𝑇2can be respectively modeled by the FSMs 𝐺𝑇1
and 𝐺𝑇2in Fig. 2.
𝑇1
𝑇2
𝐶
(a) Layout of the transmission process.
𝐺𝑇1:
𝐺𝑇2:
𝑟𝑒𝑞 1
𝑟𝑒𝑞 2
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1
𝑎𝑐𝑘2
(b) Plant model.
𝑟𝑒𝑞 1
𝑟𝑒𝑞 1
𝑟𝑒𝑞 1
𝑟𝑒𝑞 2
𝑟𝑒𝑞 2
𝑟𝑒𝑞 2
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛2
𝑡𝑟 𝑎𝑛2
𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1
𝑎𝑐𝑘1
𝑎𝑐𝑘1
𝑎𝑐𝑘2
𝑎𝑐𝑘2
𝑎𝑐𝑘2
𝐺:
(c) Composed plant model 𝐺=𝐺𝑇1k𝐺𝑇2.
Figure 2: Example of a concurrent transmission system.
When seen as a DES, the following events are observable through-
out the process:
•𝑟𝑒𝑞1and 𝑟𝑒𝑞2: are request messages arriving for transmis-
sion in 𝑇1and 𝑇2, respectively;
•𝑡𝑟 𝑎𝑛1and 𝑡𝑟𝑎𝑛2: model the start of transmission in 𝑇1and
𝑇2, respectively; and
•𝑎𝑐𝑘1and 𝑎𝑐𝑘2: reset the channel 𝐶for new transmissions.
From the observation of this event set, one can construct a plant
model that represent the dynamic behavior of the channel based
on its evolution over the state-space. A proposal for modeling each
2020-06-29 13:25. Page 3 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
transmitter is shown in Fig 2(b), where each model is composed of
three states, meaning respectively the process idle, waiting trans-
mission, and transmitting. The plant 𝐺=𝐺𝑇1k𝐺𝑇2has 9 states and
18 transitions, and it is displayed in Fig 2(c).
Remark that, even for this simple example of two transmitters,
the plant model 𝐺requires a substantially large state-space to ex-
pose and unfold all possible sequences for the system. Yet, the de-
signer has never actually faced this complexity, as the most com-
plex FSM is modeled with only 3 states and model 𝐺emerges from
an automatic composition.
As the behavior of each transmitter (and consequently of 𝐺) is
unrestricted, i.e., it does consider channel limitations neither the
sharing of 𝐶with other transmitters, they have to be restricted to
some extent. This is approached next.
3.5 Restricting the system
When components of a DES are modelled by FSMs, their composi-
tion leads to a plant 𝐺that expresses the unrestricted system be-
haviour. In practice, the plant components need to follow a certain
level of coordination for them to operate concurrently and behave
as expected.
For this purpose, an additional structure called a restriction, here
denoted by 𝑅, has to be composed with the plant. A restriction can
be seen as a prohibitive action which is expected to be observed in
the system behavior. In summary, as the plant has been modeled by
a composition of constraint-free subsystems, we now disable some
of its eligible events, adjusting them to cope with the restrictions.
In practice, a restriction
𝑅=k𝑖∈𝐼𝑅𝑖,
for 𝑖∈𝐼={1,··· , 𝑛}, can also be expressed by automata and
composed automatically to the plant 𝐺. This leads to the so-called
closed-loop behavior, in this paper denoted by 𝐾=𝐺k𝑅.
3.5.1 Example of modeling restrictions. For the transmission sys-
tem example in Fig 2, consider that𝐶has capacity for transmitting
only one message at a time. In this case, the behaviors of the trans-
mitters have to be restricted with respect to the channel capacity.
An FSM that models such a limitations presented in Fig. 3.
𝑅:
𝑡𝑟 𝑎𝑛1, 𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1, 𝑎𝑐𝑘2
Figure 3: Mutual exclusion restriction 𝑅for the channel 𝐶.
That is, 𝑅imposes mutual exclusion to transmissions in the
channel. It allows both transmitters to start transmission (events
𝑡𝑟 𝑎𝑛1and 𝑡𝑟𝑎𝑛2from the initial state), but, as soon as one of them
occupies the channel, the other is prohibited to transmit until an
acknowledge (𝑎𝑐𝑘1or 𝑎𝑐𝑘2) is received (both 𝑡𝑟𝑎𝑛1and 𝑡 𝑟𝑎𝑛2are
disabled in the non-initial state).
3.6 Closed-loop modeling
Remark, therefore, that controlling a DES plant 𝐺relies, first of
all, on obtaining a model 𝑅that reflects correctly the expected re-
quirements. This is a design tasks that in this paper is assumed to
be well-defined.
From the composition 𝐾=𝐺k𝑅, one obtains an FSM 𝐾that
models the closed-loop system behavior, i.e., the system behavior
under the control of 𝑅. For the previous example of the transmis-
sion system in Fig 2, for instance, 𝐾=𝐺k𝑅has 8 states and 14
transitions, and it is displayed in Fig 4.
𝑟𝑒𝑞 1
𝑟𝑒𝑞 1
𝑟𝑒𝑞 1
𝑟𝑒𝑞 2
𝑟𝑒𝑞 2
𝑟𝑒𝑞 2
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛2
𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1𝑎𝑐𝑘1
𝑎𝑐𝑘2
𝑎𝑐𝑘2
𝐾:
Figure 4: Graphical view of 𝐾=𝐺k𝑅.
Model 𝐾can be converted into implementable hardware lan-
guage, for practical use [23].
3.7 Controllability of events
Remark that 𝐾can be seen as an preliminary version of a control
logic for the plant 𝐺. However, from an industrial point of view,
𝐾is expected to have some additional properties before it can be
implemented. It is expected, for example, that it differentiates con-
trollable and non-controllable events.
By now, we have assumed that all events in 𝐺can be handled
under control. In practice, however, some events may occur in an
involuntary fashion, so that they cannot not be directly handled.
These events are called uncontrollable and not considering them
may violate the control consistency, situation when the controller
commands a certain action that cannot be reproduced physically.
A communication breakdown or a signal dropout, for example,
are samples of uncontrollable events. In Fig 2,𝑖𝑅𝑒𝑞 and 𝑖𝐴𝑐𝑘 ,𝑖=1,2
are also uncontrollable, as one cannot decide whether or not a mes-
sage arrives or a transmission is confirmed. They are observable
and expected to occur, eventually, but they cannot be handled in
advance by the controller, and therefore must be kept free to occur.
𝑆0𝑆1𝑆2
𝑟𝑒𝑞𝑖
𝑑𝑟𝑜𝑝𝑜𝑢𝑡
𝑡𝑟 𝑎𝑛𝑖
𝑎𝑐𝑘𝑖
Figure 5: Addiction of a dropout signal in the plant models
Formally, the idea of controllability of events can be presented
by partitioning the set of events of an FSM, such that Σ=Σc∪Σu
2020-06-29 13:25. Page 4 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
turns to be the alphabet resulting from the union of controllable
(Σc) and uncontrollable (Σu) events. Then, mathematical operations
can be defined to extract from a model 𝐾its sub-model that re-
spects the impossibility of disabling events in Σu. This is the ker-
nel of, for example, some control synthesis methods such as the
Supervisory Control Theory [24] and its several extensions.
In this paper, controllability will be considered and it plays an
essential role in the results to be derived later in Section 5involving
RL. However, we do not go through the synthesis processing step,
leaving this to the engineer’s discretion. In fact, our goal here is to
associate a controllability-aware model 𝐾with an adaptive-aware
RL method, so that the controllability itself can be exploited either
before or after the proposed conversion of 𝐾to gym.
If on one hand, processing synthesis over 𝐾(before the RL treat-
ment) assigns robustness to the control system, on the other hand
it reduces its chances for flexibility. Our approach works for both
strategies, but we opted by abstracting the synthesis step in order
to illustrate the potential of our approach for flexible, customizable,
control.
3.8 Reinforcement Learning
Reinforcement learning is a computational paradigm in which an
agent wants to increase its performance based in the reinforce-
ments that receive on interacting with an environment, learning
a policy or a sequence of actions [9][31]. To do that, the agent that
acts in an environment perceives a discrete set of states 𝑆, and re-
alizes a set 𝐴of actions. In each time step 𝑡, the agent can detect its
actual state 𝑠and, according to this state, choose an action to be ex-
ecuted, which will take it to another state 𝑠′. To each state-action
pair (s,a) there is a reinforcement signal given by the environment,
𝑅(𝑠, 𝑎) −→ Rto the agent when executing and action 𝑎in state 𝑠.
The most traditional way to formalize the reinforcement learn-
ing consists on using the concept of Markov Decision Process (MDP).
An MDP is formally defined by a quadruple 𝑀=h𝑆 , 𝐴,𝑇 , 𝑅i, where:
•𝑆is a finite set of states in the environment;
•𝐴is a finite set of actions that the agent can realize;
•𝑇:𝑆×𝐴−→ Î(𝑆)is a state transition function, where
Î(𝑆)is a probability distribution over the set of states 𝑆and
𝑇(𝑠𝑡+1, 𝑠𝑡|𝑎𝑡)defines the probability of realizing the transi-
tion from state 𝑠to state 𝑠+1 when executing an action 𝑎𝑡,
and
•𝑅:𝑆×𝐴−→ Ris a reward function, which specifies the
agent’s task, defining the reward received by an agent for
selecting action 𝑎being in a state 𝑠.
The successful application of modern RL was mainly demon-
strated in board games (e.g. backgammon [34], go [28] and Atari
games [19]). In all these cases, the environments are MDPs or par-
tially observable MDPs (POMDPs). Figure 6shows a simple exam-
ple of an MDP with 5 states and 6 actions, in which state 1 is the
initial state and 5 is the terminal (or objective) state.
In the MDP above, we can assume that the quadruple 𝑀=h𝑆,
𝐴,𝑇 , 𝑅iis consisted of these elements:
•𝑆={1,2,3,4,5};
•𝐴={Read a book, Do a project, Publish a paper, Get a Raise,
Play Video Game, Quit};
Initial
1
234
Terminal
5
Read a book
R=-4
Quit
R=0
Quit
R=0
Do a project
R=-2
Publish a paper
R=-1
Get a raise
R=+12
Play
video game
R=+1
Play
video game
R=+1
0.1
0.9
Figure 6: Example of an MDP of a real life academic.
•The state transition function consists of all probabilities be-
ing 100% except for states 2 and 3 for action Play Video
Game, which have a probability of 10% of returning to state
1 and 90% of going into terminal state 5. This transition is
denoted by ⊗symbol.
•The reward table is shown in Table 1, which consists on all
of the rewards received by the agent on doing the action
related to the table’s columns being on the state related to
the table’s line. Note that all cells with a NaN (not a number)
indicates that the transition is not possible on the MDP.
In order to an agent maximize his rewards, he need to learn that
the cumulative reward over time can only be maximized when tem-
porary punishments, that is, negative rewards, are accepted. In the
example above, for the agent to get a raise, first he needs to read
a book, do a project and publish a paper, and all of these previous
actions give the agent a negative reward, but in the end, his re-
ward has added a positive 12, so all of the hard work was worth it.
Therefore, the agent need to take in account not only immediate
rewards, but also possible future rewards. A single episode 𝑒𝑀 𝐷𝑃
can be described as a sequence of states, actions and rewards:
𝑒𝑀𝐷 𝑃 =𝑠0, 𝑎0, 𝑟 0, 𝑠1, 𝑎1, 𝑟1,··· , 𝑠 𝑛−1, 𝑎𝑛−1, 𝑟𝑛−1, 𝑠𝑛 (1)
where 𝑠𝑖represent the i-th state, 𝑎𝑖the i-th action and 𝑟𝑖the i-th
reward. The total future reward at any time point 𝑡is given by:
𝑅𝑡=𝑟𝑡+𝛾𝑟𝑡+1+𝛾2𝑟𝑡+2+ · · · + 𝛾𝑛−𝑡𝑟𝑛(2)
where 𝛾∈ (0,1)represents the discount factor and models how
strongly the agent takes future rewards into account. Values close
to 0 will represent a short-sighted strategy as higher-order terms
for rewards in the distant future become negligible. If the environ-
ment is deterministic, 𝛾can be set to 1 as the same actions always
result in the same rewards [14].
With that, it is possible to define the aptitude of an agent that
learns with reinforcement as the aptitude to learn a policy 𝜋∗:
𝑆×𝐴which maps the actual state 𝑠𝑡in a desired action, being
able to maximize the accumulated reward over time, describing
the agent’s behavior ([9]).
A good strategy to try maximize the future reward can be learned
through the state-action value function, or also the Q function. It
2020-06-29 13:25. Page 5 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
Read a Book Do a Project Publish a Paper Get a Raise Play video game Quit
State 1 -4 NaN NaN NaN NaN 0
State 2 NaN -2 NaN NaN +1 0
State 3 NaN NaN -1 NaN +1 NaN
State 4 NaN NaN NaN +12 NaN NaN
State 5 NaN NaN NaN NaN NaN NaN
Table 1: Reward table for Figure 6MDP.
specifies how good it is for an agent to perform a particular action
in a state with a policy 𝜋. We can define the Q values as follows:
𝑄𝜋(𝑠, 𝑎)=E𝑠′h𝑟+𝛾E𝑎′𝜋(𝑠′)[𝑄𝜋(𝑠′, 𝑎′)]i(3)
where 𝑄𝜋(𝑠, 𝑎)is the Q value (quality value) of a policy 𝜋of
doing an action 𝑎in a state 𝑠, and Eis the expected value. This
equation is also called the Bellman Equation.
There is a wide variety of algorithms of RL, such as Q-Learning
[36], H-Learning [33], Dyna [30], Sarsa [31] Deep Q [19], among
others.
In this paper we make use of two well known algorithms in RL:
Q-learning and Deep Q network.
3.9 Q-learning
The Q-learning algorithm [35] has attracted lots of attention for
its simplicity and effectiveness. This algorithm allows to establish
a policy of actions in an autonomous and iterative way. It can be
shown that the Q-Learning algorithm converges to an optimal con-
trol procedure, when the learning hypothesis of state-action pairs
Q is represented by a complete table holding the information of the
value from each pair. The convergence occurs in both deterministic
and non-deterministic Markov Decision Process.
The basic idea of Q-learning is that the learning algorithm learns
an evaluation function under all the state-action pairs 𝑆×𝐴. The
Q function provides a mapping in the form 𝑄:𝑆×𝐴−→ 𝑉where
𝑉is the value of the expected utility of executing an action 𝑎in a
state 𝑠. Since the agent’s partitions of both state space and action
space do not omit relevant information, once the optimal function
is learned, the agent will know which action will result in a better
future reward in all of the states.
Considering the Bellman Equation (Equation 3), if the policy 𝜋
tends to be an optimal policy, the term 𝜋(𝑠)can be considered as
𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄𝜋(𝑠, 𝑎 ). Then we can rewrite the Equation 3as:
𝑄𝜋(𝑠, 𝑎)=E𝑠′h𝑟+𝛾𝑚𝑎𝑥𝑎′𝜋(𝑠′)𝑄𝜋(𝑠′, 𝑎 ′)i(4)
From the last equation, in a discrete state space the Q-learning
uses an online off-policy update, so the equation of the Q-values
can be formulated as:
𝑄𝜋(𝑠, 𝑎)=𝑄𝜋(𝑠 , 𝑎) + 𝛼𝑟+𝛾𝑚𝑎𝑥𝑎′𝑄(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎 )(5)
where 𝑄(𝑠, 𝑎)is the Q-value of executing an action 𝑎in a state
𝑠,𝛼is the learning rate of the training, and 𝑚𝑎𝑥 𝑄 (𝑠′, 𝑎′)is the
highest value in the Q-table in the line related to the state 𝑠′next
to the state 𝑠, that is, the action of the next state that has better
returns according to the Q-table. This Q function represents the
discounted expected reward under taking an action 𝑎when visiting
state 𝑠, and following an optimal policy since then. The procedural
form of the algorithm is shown below.
Algorithm 1: Q-learning procedure
1Receives 𝑆,𝐴,𝑄as input;
2For each 𝑠, 𝑎 , set 𝑄(𝑠, 𝑎 )=0;
3while Stop Condition == False do
4Select action 𝑎under policy 𝜋;
5Executes 𝑎;
6Receive immediate reward 𝑟(𝑠 , 𝑎);
7Observe new state 𝑠′;
8Update 𝑄(𝑠, 𝑎)according to Equation 5;
9𝑠←𝑠′;
10 end
11 Return 𝑄;
Note that the stop condition can be executing a specific number
𝑛of steps in a single episode, reach a terminal state, among other
methods. And once all state-action pairs are visited a finite number
of times, it is guaranteed that the method will generate estimates
𝑄, which converge to a value 𝑄∗[36]. In practice, the action pol-
icy converge to the optimal policy in a finite time, although slowly.
Moreover, it is possible to learn the ideal control directly, with-
out modelling the transition probabilities or the expected rewards
present on the Equation 4, this makes potential the Q-learning use
in Discrete Event Systems.
However, Equation 5always chooses the action with the highest
Q-value for that specific state 𝑠𝑖. Let us say that, at first, the agent
is in the MDP initial state 𝑠0, and take an action 𝑎1which gives
him a good reward, so upon updating the Q-table, there is a value
higher than 0 in 𝑄(𝑠0, 𝑎 1). The agent does not know that if he takes
an action 𝑎0in the initial state 𝑠0gives him a better reward than
taking action 𝑎1, but he will always chooses action 𝑎1in that state
because it is the maximum value on the Q-table, and therefore will
not have knowledge that there are better actions to be taken. To
improve that, so that the agent explores all state-action pairs in
table Q, the use of epsilon-greedy as a policy 𝜋of Algorithm 1 is
required.
The epsilon-greedy policy detailed in Algorithm 2, shows that
the objective is to switch between exploration (choose random ac-
tion) and exploitation (choose best action) according to a value 𝜖
between 0 and 1 that defines the probability of choosing a random
2020-06-29 13:25. Page 6 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
action to explore all the actions in the environment. Upon receiv-
ing the current state 𝑠𝑖, the Q-table 𝑄and 𝜖as the input, the algo-
rithm generates a random number between 0 and 1. If this number
is higher than 𝜖, then the policy will choose the action with the
highest Q-value in that state 𝑠𝑖according to 𝑄. But if the number
is lower or equal than 𝜖, then it will choose a random action to take
in the environment.
Algorithm 2: Epsilon greedy policy
1Receives 𝑠𝑖,𝑄,𝜖as input;
2Generate random number 𝑥between 0 and 1;
3if 𝑥>𝜖then
4Choose action with highest Q-value in 𝑄(𝑠𝑖);
5else
6Choose random action;
7end
In the end, Q-learning is a tabular method which is very useful
in many systems, however, its effectiveness is limited to environ-
ments with a reduced number of states and actions, making an ex-
haustive search for all possible state-action pairs contained in the
environment. If we consider an environment with an extremely
high number of states, and in each state, there is a high number
of possible actions to be taken. It would be a waste of time and
memory to explore all of the possible state-action pairs. A better
approach would be to use an aproximation function with 𝜃param-
eters in which 𝑄(𝑠 , 𝑎;𝜃) ≈ 𝑄∗(𝑠, 𝑎).
To do that, we can use a neural network with parameters 𝜃to
approximate the Q-values for all possible state-action pairs [14].
This approach was created in [19] with the objective of making
an AI to learn how to play atari games, and it was called Deep Q
Network (DQN).
3.10 Deep Q Network
The goal of the deep Q network is to make unnecessary the exhaus-
tive search for all state-action pairs of the environment. Because
in some cases, there can be a huge number of states. An example is
a servo motor’s posicion, which can assume values between 0 and
180 degrees, however, this is not a discrete value, but a continuous
variable that can assume many values in this interval.
Another case is an atari game, that has in image of 210 x 160
pixels, and a RGB color system in which each channel can be var-
ied from 0 to 255. If we consider that each state is a single possible
frame, the number of possible states in the environment is approxi-
mately 210𝑥160𝑥255𝑥255 ≈5.57𝑒11. In the Atari case, a single pixel
does not make much difference, so both images are treated as a sin-
gle state, but is still necessary to distinguish some states. In [18],
the Deepmind researchers have tested a convolutional neural net-
work with seven different atari games, and in [19] they made some
optimizations in the same neural network and in the environments
which became possible the training of 49 different games. In both
cases, the authors used a convolutional neural network to extract
the game state, and then use dense layers to get an approximation
function of the Equation 5.
In DQN, the loss function at iteration 𝑖that needs to be opti-
mized is the following:
𝐿𝑖(𝜃𝑖)=E𝑠,𝑎,𝑟,𝑠′hˆ
𝑦𝑖−𝑄(𝑠, 𝑎;𝜃𝑖)2i,(6)
where ˆ
𝑦𝑖=𝑟+𝛾𝑚𝑎𝑥𝑎′𝑄(𝑠′, 𝑎 ′;𝜃′), and 𝜃′denotes the parame-
ters of the target network. This target network is the same as the
original network, but as the original network is updated every step,
the target network is updated every N steps, and each of these up-
dates corresponds to the target network copying the parameters
of the original network.
These 2 separate networks are created because in some envi-
ronments, like the Atari games, there is a lot of consecutive states
very similar to each other, since these 2 steps can be a single pixel
changing, so there is a lot of correlation between them. In this
case, the values of 𝑄(𝑠 , 𝑎;𝜃)and 𝑄(𝑠′, 𝑎′;𝜃)are very similar, which
means that the neural network cannot distinguish well between
both states [14], causing the training to be really unstable.
Another important ingredient to reducing the correlation among
steps is the Experience Replay, where the agent accumulates a buffer
D=𝑡1, 𝑡2, ..., 𝑡𝑖with experiences 𝑡𝑖=(𝑠𝑖, 𝑎𝑖, 𝑟𝑖, 𝑠 𝑖+1)from many
episodes. And the network would be trained by sampling from D
uniformly at random instead of directly using the current samples.
And the loss function can be expressed as:
𝐿𝑖(𝜃𝑖)=E𝑠,𝑎,𝑟,𝑠′∼𝑢(D ) hˆ
𝑦𝑖−𝑄(𝑠, 𝑎;𝜃𝑖)2i(7)
Figure 7[18] shows the neural network structure used in [19]
since the extraction of the game state until the Q value of each
action to the related state. In this case, the actions to be used are
related to each button (or combination of buttons) of an Atari con-
troller.
Figure 7: Neural network used for training the games.
4 RELATIONSHIP BETWEEN MDP AND FSM
AND THE IMPLICATION OVER RL
BEHAVIOR
By comparing the structure of an MDP and an FSM, the following
can be highlighted to help sustaining the propositions in Section
5:
•A state of an MDP is equivalent to a state of an FSM, there-
fore, also the initial state of an MDP is equivalent to the
2020-06-29 13:25. Page 7 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
initial state of an FSM, and a terminal (marked) state of an
MDP coincides with a terminal (objective) state of an FSM.
•An action in an MDP coincides with an event that triggers
a transition in an FSM;
•In an MDP, an action is executed by an agent, while an event
is expected to be handled by a controller in an FSM mod-
eling a DES. Both events and actions have controllability
issues, i.e., some events are expected to be uncontrollable,
and some actions cannot be entirely decided by the agent.
Take the MDP in Figure 6as an example. In states 2 and 3
the agent can make the action “Play video game", but the
subsequent state is undetermined, since there is a probabil-
ity of 90% to reach the terminal state 5, and 10% to reach the
initial state 1. To adapt this as a DES, we may call the action
“Play video game" as a controllable event, that evolves the
DES to the intermediary state in the MDP, the one with the
two transition probabilities. And these two transitions (0.9
and 0.1) would be labeled with uncontrollable events that
evolve the DES to the initial state represented by the MDP
state 1, or to the terminal state 5. In summary, all actions in
an MDP are events in a DES, but only actions correspond-
ing to controllable events can be taken by the agent, the oth-
ers are unknown. In this work, we consider to split the set
of actions 𝐴into the sets 𝐴𝑐and 𝐴𝑢, defining respectively
controllable and uncontrollable actions, such that 𝐴𝑐⊆𝐴,
𝐴𝑢⊆𝐴, and 𝐴𝑢∪𝐴𝑐=𝐴follow by construction.
To make the understanding clearer, Table 2shows a summary
of the relationship between a MDP and a DES.
Despite apparent similarities, a DES model does not include na-
tively a reward processing system to evaluate the agent’s actions.
This prevents it to be directly exploited for control optimization
purposes, which is an important feature in I4.0 environments. In
contrast, RL is a reward-aware by construction, which makes it
closer to the I4.0 needs, but it lacks immediate resources for safety-
aware DES modeling and control.
In this paper, we claim that both approaches are useful, can be
combined to some extent, but they are not because of the lack of
integration tools. Based on this claim, we consider to address ba-
sic control requirements (those related to safety) directly on the
FSM level (see section 3.6), and optimization requirements by con-
verting the semi-controlled DES into a RL model. The RL reward
processing step is described in the following.
4.1 Reward processing analysis
In RL, two notions of rewards can be considered: immediate and
delayed. In I4.0 context, a delayed reward would be, for example,
the conclusion of a manufacturing step, leading to a profit. In terms
of modeling, this profit should to be set manually for all events (or
actions) that lead to a terminal state.
Many real-life applications of RL use delayed rewards, so that RL
naturally solves the difficult problem of correlating immediate ac-
tions with the delayed returns they produce. Like humans, RL algo-
rithms sometimes have to wait a while to collect return from their
decisions. They operate in a delayed return environment, where
it can be difficult to understand which action leads to a specific
outcome over many time steps.
On the other hand, immediate rewards are also possible to im-
plement and they can be positive or negative. A positive imme-
diate reward, in I4.0, can be for example part of a final product
being manufactured (e.g. the lid of a kettle). Although the part is
not related entirely to possible profit, when finished it can bring
indirect benefits to the production plant (e.g. it can release specific
machines to steps that lead to the full product manufacturing).
Differently, a negative immediate reward is any action (i.e., an
event in a DES) that consumes resources (e.g. energy, time, raw
material) and does not result in a complete product, or part of it.
The more common case is the association of small negative values
to events that cause waste of time or energy (e.g. a robot move).
In other cases, such as a broken machine (i.e. an uncontrollable
event) can have very high negative reward, but the probability for
this event to occur is very low. In this case, the RL method needs to
learn through simulation the chances for this event to occur, and
if this results in positive rewards by the end of a manufacturing
process (delayed reward). In the broken machine example, the im-
mediate negative reward can be associated with the cost of repair,
while the delayed reward can be associated with the profit of pro-
ducing a number 𝑛of assets.
It is common, for a I4.0 process modeled as DES, to have positive
rewards only on the events that reach final states, while the inter-
mediate events are all negative and can be associated with time
costs, raw material, power cost, labor, etc.
5 DES MODEL CONVERSION TO A GYM
WRAPPER
In this section, we detail the proposed conversion of DES models
into trainable gym wrapper. The resulting procedures and codes
can be accessed at [12].
For a better understanding of the methods and functionalities
of a gym wrapper, we first detail the main characteristics of this
type of environment. Then, we describe features of the FSM that
serves as input to the learning environment. Finally, we present
a detailed methodology and conversion steps, which are exposed
and released in the form of a computational tool.
5.1 Gym wrappers
Gym is a toolkit for developing and comparing reinforcement learn-
ing algorithms. It makes no assumptions about the structure of the
agent, and is compatible with any numerical computation library,
such as TensorFlow or Pytorch [4].
The goal of a Gym wrapper is to interact with a modeled en-
vironment by exploring its state-space. By observing states, the
agent learns how good it would be to perform specific actions in
each state. This allows it to choose actions that maximize rewards
by the end of a task, thus obtaining an optimal action policy.
The step method in our gym class returns four values that rep-
resent all the necessary information needed for training. They are
detailed in the following:
•Observation (Object): a specific object of the environment
that represents the observation of the current state in that
environment;
•Reward (Float): amount of reward received by the previ-
ously chosen action;
2020-06-29 13:25. Page 8 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
Discrete Event System
(DES)
Markov Decision Process
(MDP)Symbol as DES Symbol as MDP Elements
Relationship
Set of states Is the set of states of an MDP 𝑄 𝑆 𝑄 =𝑆
Initial State Is the initial state of an MDP 𝑞◦𝑠0𝑞◦=𝑠0
Set of marked states Is the set of terminal states of an MDP 𝑄𝜔𝑆𝑚𝑄𝜔=𝑆𝑚
Controllable Events Is all of the actions that the agent has control over it, that is, it can
decide whether or not to take that action. Σc𝐴𝑐Σc=𝐴𝑐⊆𝐴
Uncontrollable Events
In MDP, the uncontrollable events represents the
set of uncontrollable actions, in which the RL agent does not
have control, e.g. the transition probabilities in
Figure 6. The transition probabilities in a DES
are uncontrollable events. In this paper, we consider that the user specifies
probabilities for all uncontrollable events and,
if not specified, there is equal probabilities for them to trigger.
For example, in a state 𝑥, there is 2 enabled uncontrollable
events, so there is 50% of chances for each to trigger.
Σu𝐴𝑢Σ𝑢=𝐴𝑢⊆𝐴
Transitions
A transition in a DES is also a transition in an MDP, which evolves
the environments from a state to another. Although in an MDP there is some
transitions followed by probabilities of going to different states, in this paper
we consider that all of the transitions of the generated MDP have
100% of probability to go to the next state, i.e., once a transition is activated,
we will know what will be the next state in the environment.
𝑓 𝑇 𝑓 =𝑇
Table 2: Relationship between DES and RL environments
•Done (Boolean): identifies if the task for the environment
is complete. This can happen when the MDP is in a termi-
nal state or simply performing a specific 𝑁number of steps.
After, this variable becomes True, a method called reset is in-
voked to return the environment to its initial state and reset
all rewards to 0;
•Info (Dictionary): diagnoses information used for debug-
ging. It can also be eventually useful for learning.
From the observation object, we can choose actions to be taken
by the agent, that go through all states in the environment, captur-
ing the required information. Note that all actions are assumed to
be available for the agent, in all possible states of the environment.
In most gym environments, available at the original gym website,
there is also a render function that allows the user to see the visual
representation of the environment. This can be a screenshot on an
atari game, a drawing of a chess configuration board, etc.
In this paper, as we working specifically with FSMs, all environ-
ments are represented as automata. The render function will return
the automaton of the environment, with initial, current, and last
state specified in the observations.
5.2 DES system to a gym wrapper
The first step to convert a DES model into a gym wrapper is design-
ing the plant 𝐺and specifications 𝑅in a suitable modeling software.
Here we use Supremica [39], a design-friendly tool that includes
resources for both modeling and control tasks, besides to allows
composing, simulating, and checking the correctness of FSM mod-
els.
Upon composition, one obtains the FSM 𝐾=𝐺k𝑅, i.e., the be-
havior expected for the system under control, exactly as projected
by the engineer. The model 𝐾could be further exploited for control
purposes, as for example in terms of controllability of its events,
nonblockingness, etc., which is quite straightforward in control
engineering practice. However, for the purposes in this paper, we
consider to keep 𝐾as it is, i.e., including immediate control ac-
tions projected via engineering, for it to be further refined using
RL. This differs, to certain extent, to the control practice, which not
usually associate the robusteness of control with the smoothness
and sensitivity of RL techniques. We claim this as a novelty of our
proposal.
With the pre-controlled FSM 𝐾on hands, useful information
can be extracted and converted to an MDP. For example, the set of
states, initial state, set of marked states, events (controllable and
uncontrollable), and transitions (see Table 2). In practice, these in-
formation were exported as a XML structure, and parsed to a gym
class in order to create an MDP environment.
Then, this MDP environment is a class containing all informa-
tion specified by the DES, following the relationship in Table 2.
The attributes of this new class are: states, events, and transitions
of the input DES. There is also some attributes that specify the ini-
tial state, set of terminal states, and controllability of events. In a
standard gym wrapper, it is considered that all actions are possible
in all states. In this work, however, we modify this assumption to
consider only actions possible in a given state. This coincides with
FSM models that have their state-transition formalized as a partial
function. For a given state 𝑠𝑖, we may call 𝐴𝑖the subset of possible
actions in that state, whith 𝐴𝑖⊆𝐴.
In addition to the information described in the Table 2, a RL envi-
ronment also requires the reward structure for taking a specific ac-
tion in a given state. In this work, we consider that all actions taken
by the agent return an immediate reward representing a profit (e.g.
the production of a workpiece) or a loss (e.g. spent time, consumed
energy, raw material) in our system. Thus, for the environment to
work, it is necessary a set of rewards 𝑅, in which an element 𝑟𝑖∈𝑅
is specified for each action in the environment. 𝑅is provided by as
a parameter to the reset method in the form of a list. As default, all
actions receive a loss of −1, informing that every action has a loss.
The loss type does not need to be specified in the system, as it is
a generalization of any desired optimization objective. All default
rewards can be converted into a positive reward (a profit) or nega-
tive (loss). These rewards depend on the system to be modeled, and
there are no predefined rules. However, in most cases, positive re-
wards are defined for actions related to production, as it represents
profit in industrial environments.
2020-06-29 13:25. Page 9 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
On the other hand, there are some features of DESs that are not
directly converted in MDP, such as the controllability of events.
Although both types of events make an FSM evolve from a state to
another, we have to consider that the uncontrollable events cannot
be activated or disabled by control. These events are triggered by a
probability function (see the example of "Play video game" action
in Figure 6). Therefore, we consider that the set of uncontrollable
events (Σu) of a DES turns into a set of uncontrollable actions 𝐴𝑢
in an MDP. These actions cannot be chosen by a RL agent and their
activation will be determined by an associated uniform probability.
The set of probabilities, 𝑃, in which each 𝑝𝑖∈𝑃is the probability of
the system to trigger an action 𝑎𝑖, can also be passed by parameter
to the reset function.
By default, if 𝑃is not specified, then the policy 𝜋, in the explo-
ration mode, chooses randomly between all the possible actions,
both controllable and uncontrollable, to be triggered in a given
state. That is, if in a given state 𝑠𝑖we have two possible uncon-
trollable actions (𝑎1and 𝑎2) and one controllable action (𝑎3), then
there is 33% of chances for 𝑎1and 𝑎2to be chosen, and 33% for 𝑎3.
In Figure 10, the action 𝑎𝑐𝑘1is an example of an uncontrollable
action that always triggers. This default configuration reflects the
situation where the DES designer is unaware of the activation fre-
quency for uncontrollable transitions.
If 𝑃is specified only for some uncontrollable actions, but not
for all, then we verify if these actions trigger. If not, we do epsilon-
greedy with the remaining actions in order to trigger one of them.
Take as an example the model in Figure 5, where we can specify a
probability of 1% to the dropout action. In that case, this action is
evaluated first. After that, other possible actions are evaluated by
the exploration or exploitation strategies of the policy 𝜋.
Remark 1. In case of choosing exploitation, we need to pick up in
that state the action with the highest Q-value, as long as the action
is controllable, as the agent cannot choose uncontrollable actions.
Remark 2. Triggering or not uncontrollable actions, based on prob-
abilities or the exploration, is a matter only during the training phase.
Upon training, the agent chooses only controllable actions with the
highest Q-value, while uncontrollable ones are triggered physically.
Remark 3. Controllable events are not triggered by probabilities,
since the control agent decides whether or not to take it. In this way,
the RL agent learns an optimal policy on training, and it chooses the
action to take in a given state based on the best possible reward. But
the selected action cannot be uncontrollable, because the agent can
not choose to either activate it or not.
From these remarks, there are some considerations to introduce.
For example, the exploration-exploitation policies need to be ex-
tended. Epsilon-greedy for example, turns into a new algorithm
called Controllable Epsilon-greedy policy, presented in Algorithm 3.
In the new implementation, the policy function receives the pos-
sible transitions 𝐴𝑖enabled from a state, their controllable nature,
and the list of transition probabilities 𝑃.
First, the algorithm stores all possible uncontrollable actions of
𝐴𝑖in a list of possible uncontrollable transitions called 𝐴𝑖
𝑢, i.e.,
𝐴𝑖
𝑢=𝐴𝑖∩𝐴𝑢. Similar idea applies for controllable transitions,
i.e., 𝐴𝑖
𝑐=𝐴𝑖∩𝐴𝑐. If 𝐴𝑖
𝑢≠∅, then we iterate over 𝐴𝑖
𝑢, in which
𝐴𝑖
𝑢[𝑗]=𝑎𝑖
𝑢 𝑗 and verify if 𝑃[𝑎𝑖
𝑢 𝑗 ]>0. If so, then we store the tuple
Algorithm 3: Controllable Epsilon greedy policy
1Receives 𝑠𝑖,𝑄,𝜖,𝐴𝑖,𝑃,𝐴𝑢as input;
2𝐴𝑖
𝑢=𝐴𝑖∩𝐴𝑢;
3𝐴𝑖
𝑐=𝐴𝑖∩𝐴𝑐;
4if 𝐴𝑖
𝑢≠∅then
5Create list Δ;
6while Iterate over 𝐴𝑖
𝑢such that 𝐴𝑖
𝑢[𝑗]=𝑎𝑖
𝑢 𝑗 do
7if 𝑃(𝑎𝑖
𝑢 𝑗 )then
8Generate random number 𝜁𝑖between 0 and 1;
9Δ← (𝑎𝑖
𝑢 𝑗 , 𝑃 (𝑎𝑖
𝑢 𝑗 ), 𝜁𝑖);
10 Remove 𝑎𝑖
𝑢 𝑗 from 𝐴𝑖;
11 end
12 end
13 if Δ≠∅then
14 Shuffle Δ;
15 while Iterate over Δdo
16 if 𝜁𝑖>𝑃(𝑎𝑖
𝑢 𝑗 )then
17 Choose action 𝑎𝑖
𝑢 𝑗 ;
18 end
19 end
20 end
21 end
22 if 𝐴𝑖
𝑐≠∅then
23 Generate random number 𝑥between 0 and 1;
24 if 𝑥>𝜖then
25 Choose 𝐴𝑖
𝑐action with highest Q-value in 𝑄(𝑠𝑖);
26 else
27 Choose random action in 𝐴𝑖;
28 end
29 else
30 Choose random action in 𝐴𝑖;
31 end
(𝑎𝑖
𝑢 𝑗 , 𝑃 (𝑎𝑖
𝑢 𝑗 ), 𝜁𝑖)in a list called Δ, where: 𝑎𝑖
𝑢 𝑗 is the uncontrollable
possible action of index 𝑖in 𝐴𝑖
𝑢;𝑃(𝑎𝑖
𝑢 𝑗 )is the probability of occur-
rence of action 𝑎𝑖
𝑢 𝑗 ; and 𝜁𝑖is a randomly generated number for
each action 𝑎𝑖
𝑢 𝑗 .
If the list Δis not empty, i.e., there is a probability higher than 0
for events in the set 𝐴𝑖
𝑢, then we shuffle the list Δand iterate over
it. If, for any of the elements, there is a 𝜁𝑖higher than 𝑃(𝑎𝑖
𝑢 𝑗 ), we
choose the action 𝑎𝑖
𝑢 𝑗 . The shuffle in list Δis necessary to simulate
situations where a number 𝑛>1 of uncontrollable events is pos-
sible to occur, and it is obviously unknown which one occurs first
(e.g. broke machine or power shutdown).
In case none of the uncontrollable events is triggered, the events
in 𝑎𝑖
𝑢 𝑗 are removed from the set 𝐴𝑖. If 𝐴𝑖
𝑐is not empty, a random
number between 0 and 1 is generated. If this number is less than
𝜖, the agent chooses the action with the highest Q-value in 𝐴𝑖
𝑐,
otherwise it chooses a random possible action.
The diagram in Figure 8summarizes the steps for making the
gym wrapper adaptations from a DESs, starting from the initial
2020-06-29 13:25. Page 10 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
modelling of the system until the effective application of RL algo-
rithms.
Model the DES on Supremica
Obtain FSM 𝐾
Export FSM 𝐾as a XML structure
Parse all of the information
of the structure to Gym environment
Create aributes for states, controllable and
uncontrollable events, transitions,
possible transitions, initial and terminal states
Create a reward list for all
action space in environment
Create list of transition
probabilities in which the
frequency of occurrence of
given action is known
Apply RL algorithm using the
controllable e-greedy
policy
Figure 8: Conversion methodology.
Table 3summarizes the extra sets that, together with the DES
model parameters in Table 2, are used in the gym wrapper.
Symbol Definition
𝑅List of rewards for the
DES’s events.
𝑃
List of probability for
the DES’s events. This is
optional for the Gym
environment
𝐴𝑖Set of possible actions
in state 𝑠𝑖.
𝐴𝑖
𝑐
Set of possible controllable
actions in state 𝑠𝑖.
𝐴𝑖
𝑢
Set of possible uncontrolable
actions in state 𝑠𝑖.
Table 3: List of extra sets for the MDP conversion
6 EXAMPLES OF USE OF THE ENVIRONMENT
Two examples are presented in this section to illustrate the appli-
cability of the developed tool. The first is a DES with two machines
and an intermediary buffer to stock workpieces. The second exam-
ple is the system with the 2 concurrent transmitters anticipated in
Section 3.4. Both examples are separated in 3 steps:
(i) Modelling the system, to obtain the FSM 𝐾;
(ii) Conversion of the DES model to an MDP environment;
(iii) Apply the reinforcement learning algorithm, discuss, and
review the results.
Also, we will consider that, according to [26], we may define
that both of our environments are discrete, completely observable,
and static. Both examples and the conversion tool are available at
GitHub repository [38].
6.1 Two machines with intermediate buffering
As a study case, consider the manufacturing system shown in Fig-
ure 9, composed by 2 machines, 𝑀1and 𝑀2, and an intermediary
buffer 𝐵. Machine 𝑀1picks up a workpiece (event 𝑎1), manufac-
tures, and delivers it in the buffer 𝐵(event 𝑏1). Machine 𝑀2picks
up the workpiece from 𝐵(event 𝑎2), manufactures, and removes
it from the system (event 𝑏2). The buffer is considered to support
only one workpiece at a time. It is also assumed the possibility for
the machines to break.
𝑎1𝑎2
𝑏1𝑏2
𝑀1𝑀2
𝐵
Figure 9: Example of manufacturing system.
6.1.1 Modelling of the system: In this example, we consider the
events 𝑎𝑖as controllable, since we can avoid a machine to start, and
the events 𝑏𝑖as uncontrollable, since we cannot force a machine to
conclude a job. We also consider that events 𝑐𝑖model break events
for the machines 𝑀𝑖and they are uncontrollable, while events 𝑟𝑖
model the respective repair and they are controllable. The plant
models for 𝑀1and 𝑀2are presented in Figure 10.
Events 𝑎𝑖indicating that machine 𝑀𝑖is operating, while 𝑏𝑖leads
the automaton to its initial state, also marking a completed task.
While operating, the machine can crash, which is triggered by the
event 𝑐𝑖, and repaired with an event 𝑟𝑖, which leads the machine
back to the initial state. We design uncontrollable transitions in
red, in the hope that this could facilitate identification, while the
others are kept in black.
Assume that the control objective is to avoid underflow and
overflow in the buffer, by disabling some events occur. For this
purpose, we compose to the plant model the restriction modeled
by the FSM 𝐸1, shown in Figure 11.
2020-06-29 13:25. Page 11 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
𝑆0𝑆1
𝑆2
𝑎𝑖
𝑏𝑖𝑐𝑖
𝑟𝑖
Figure 10: Plant models 𝐺𝑖for machines 𝑀𝑖,𝑖=1,2.
𝑆0𝑆1
𝑏1
𝑎2
Figure 11: Underflow and overflow restriction 𝑅.
Model 𝑅controls both underflow and overflow of workpieces
in the buffer, by respectively disabling 𝑎2in the initial state (when
the buffer is empty in state 𝑆0), and disabling 𝑏1otherwise (when
the buffer is already full in state 𝑆1).
By composing plant and restriction models, one obtains the FSM
𝐾=𝐺1k𝐺2k𝑅, which models the expected behavior for the system
under control. 𝐾has 18 states and 42 transitions, and it is shown
in Figure 12.
6.1.2 Converting DES to MDP.After parsing the model through
the XML structure, in order to reveal the structure of 𝐾as seen
by the RL environment, we call the function render implemented
in the gym environment and the result is presented in Figure 12,
which shows the initial state of the environment, right after calling
the reset function (note the green painted state).
Also, Figure 13 shows the environment after making an action
𝑎1(note the purple painted transition), which makes the automa-
ton evolve to state 𝑆3. In state 0, none of the machines are operating.
After an action 𝑎1, the FSM moves to state 3 where machine 𝑀1is
operating until an uncontrollable action 𝑏1or 𝑐1triggers, mean-
ing that machine 𝑀1either delivers a workpiece to the buffer or
crashes, respectively. The remaining evolution follows similar rea-
soning for every triggered action.
Before a RL algorithm can be effectively applied, it remains to be
constructed a reward list for all actions in the environment, and a
probability list of all uncontrollable actions. We may consider that
all movements that lead a machine to NOT produce workpieces
return a loss of -1 as the default value; and all movements leading
to a workpiece production returns a profit of 10, in this case, only
action 𝑏2.
We also consider that a machine crash produces a loss of -4,
because of the wasting time and costs to repair. We consider that
a machine crashing has 5% chance of happening. Thus, the reward
list is shown in Table 4, where the values marked with * are default.
Action set 𝐴Reward set 𝑅Probability set 𝑃(%)
𝑎1-1* not applicable
𝑏1 -1* not specified
𝑎2 -1* not applicable
𝑏2 10 not specified
𝑐1 -4 5
𝑐2 -4 5
𝑟1 -1* not applicable
𝑟2 -1* not applicable
Table 4: Rewards for the manufacturing system example.
Is it worth remembering that the controllable actions are chosen
by the control agent to trigger, so there is no need to adopt proba-
bilities for their occurrences and we show this as not applicable in
the table. There is also some uncontrollable actions that are not re-
lated with any standard of frequency of occurrence. For them, the
probability is set as not specified. We are now in position to apply
the RL algorithm.
6.1.3 Applying RL and reviewing the results. In this example we
use the Q-learning algorithm to represent the training of the envi-
ronment. We also considered that the episode ends after the agent
performs 60 actions. For the action selection, we use the policy
implemented in Algorithm 3, and train the environment for 100
episodes. The resulting Q-table is shown in Table 5.
𝑎1𝑎2𝑏1𝑏2𝑐1𝑐2𝑟1𝑟2
St. 0 13.14 - - - - - - -
St. 1 20.69 - - 21.80 - 0.51 - -
St. 2 5.99 - - - - - - 7.24
St. 3 - - 15.74 - 3.90 - - -
St. 4 - - 23.03 24.15 8.99 5.94 - -
St. 5 - - 8.29 - -1.76 - - 12.20
St. 6 - - - - - - 10.58 -
St. 7 - - - 17.83 - -0.76 15.89 -
St. 8 - - - - - - 0.36 1.25
St. 9 17.63 18.61 - - - - - -
St. 10 24.72 - - 26.71 - 5.01 - -
St. 11 8.75 - - - - - - 14.27
St. 12 - - - - - - - -
St. 13 - - - - - - - -
St. 14 - - - - - - - -
St. 15 - - - - - - - -
St. 16 - - - - - - - -
St. 17 - - - - - - - -
Table 5: Q-table for the manufacturing system example.
The symbol "-" shows the transitions that are not possible to be
triggered. This helps to identify the most valuable actions on each
state, remembering that the agent can only choose between the
values of the two first columns, which represents the controllable
actions.
According to this Q-table, in state 0, the most valuable action to
be chosen is 𝑎1. In fact, this is the only possible action to be taken.
Action 𝑎2would be also possible, but it has been disables by control
(see restriction model 𝑅). In state 3, there are 2 possible actions: 𝑏1
2020-06-29 13:25. Page 12 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
Figure 12: FSM modeling 𝐾, the input to the Gym environment.
Figure 13: Automaton K’s state 𝑆4represented in Gym environment
and 𝑐1, but both are uncontrollable, and can not be taken by the
control agent, even knowing that an action 𝑏1would return a better
profit. In state 9 there are 2 controllable actions to be chosen: 𝑎1
and 𝑎2, in which action 𝑎2has a higher Q-value because it gives an
immediate profit of 10, while choosing 𝑎1will consequently give
this reward, but not immediately.
We further show the Q-table in Table 6, generated by the algo-
rithm if we set to 100% the probability for machine 1 to crash (event
𝑐1). That is, action 𝑐1occurs every time the FSM is in a state where
it is eligible. In this case, the FSM always makes the path through
the states 0, 3, 6, 0, which is in fact the only way possible, under
this assumption. Note that the table only shows values different
from zero in 3 cells, corresponding to transitions from state 0 to 3,
3 to 6 and 6 to 0. In summary, the table shows that, under this as-
sumption, it does not matter which actions to choose, as the agent
will always have a loss.
We can conclude that this example was modeled, converted to
an MDP environment, and trained via RL algorithm successfully.
Next, we further present second case study.
6.2 Transmitters Sharing a Channel
This example was previously shown in Section 3.4, in which two
transmitters 𝑇1and 𝑇2sharing a communication channel 𝐶that
supports only one communication request at a time, as shown in
Figure 14.
𝑇1
𝑇2
𝐶
Figure 14: Example of a concurrent transmission system.
6.2.1 Modeling of the system. : the step-by-step modelling of this
example was already shown in section 3.4, and therefore here we
concentrate the focus on the final composition 𝐾𝑇in Figure 16.
Plant ans restrictions are repeated in Figure 15.
By composing 𝐺𝑇1,𝐺𝑇2and 𝑅, we obtained the FSM 𝐾𝑇=𝐺𝑇1k𝐺𝑇2k𝑅,
with 8 states and 14 transitions, shown in Figure 16.
6.2.2 Conversion from DES to MDP:To convert the system to an
MDP, the FSM 𝐾𝑇was exported to a XML structure and parsed
through gym’s methods to extract the DES information. Since the
2020-06-29 13:25. Page 13 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
𝑎1𝑎2𝑏1𝑏2𝑐1𝑐2𝑟1𝑟2
St. 0 -13.56 - - - - - - -
St. 1 - - - - - - - -
St. 2 - - - - - - - -
St. 3 - - - - -14.69 - - -
St. 4 - - - - - - - -
St. 5 - - - - - - - -
St. 6 - - - - - - -12.61 -
St. 7 - - - - - - - -
St. 8 - - - - - - - -
St. 9 - - - - - - - -
St. 10 - - - - - - - -
St. 11 - - - - - - - -
St. 12 - - - - - - - -
St. 13 - - - - - - - -
St. 14 - - - - - - - -
St. 15 - - - - - - - -
St. 16 - - - - - - - -
St. 17 - - - - - - - -
Table 6: Q-table for the manufacturing system example un-
der the assumption that a break is 100% certain to occur.
𝐺𝑇1:
𝐺𝑇2:
𝑟𝑒𝑞 1
𝑟𝑒𝑞 2
𝑡𝑟 𝑎𝑛1
𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1
𝑎𝑐𝑘2
(a) Plant model.
𝑅:
𝑡𝑟 𝑎𝑛1, 𝑡𝑟 𝑎𝑛2
𝑎𝑐𝑘1, 𝑎𝑐𝑘2
(b) Mutual exclusion restriction 𝑅
for the channel 𝐶.
Figure 15: Example of a concurrent transmission system.
FSM has 8 states, it is possible to see it straightforwardly by calling
function 𝑟𝑒𝑛𝑑𝑒𝑟 of the gym environment. The function returns the
image presented in Figure 16.
Figure 16: Automaton 𝐾𝑇for transmitters example.
It worth remembering that the black transitions are triggered
by controllable events and the red ones are triggered by uncontrol-
lable events.
The list of rewards and triggering probabilities adopted for each
action in the system is shown in Table 7. To differentiate the use
of each transmitter, we intentionally consider that rewards for the
𝑎𝑐𝑘 signals are distinct. While 𝑎𝑐𝑘1receives a reward of 2, 𝑎𝑐𝑘2
receives a reward of 3, indicating that, by having the choice pre-
rogative, the RL algorithm should prefer to choose transmitter 𝑇2.
Action Reward Probability set 𝑃(%)
𝑟𝑒𝑞1-1* not applicable
𝑟𝑒𝑞2-1* not applicable
𝑡𝑟 𝑎𝑛1-1* not specified
𝑡𝑟 𝑎𝑛2-1* not specified
𝑎𝑐𝑘12 not specified
𝑎𝑐𝑘23 not specified
Table 7: Rewards adopted for the two transmitters example.
6.2.3 Applying the algorithm and review of the results. For this
case study, we use the Deep Q algorithm and also consider the
rewards and transition probabilities from Table 7. The used neu-
ral network consist of an embedding layer as input, in which the
state enters the neural network and is encoded in 10 distinct val-
ues. In sequence, there are 3 more fully connected layers with 50
neurons each. The output layer consists in a number of output neu-
rons equal to the number of possible actions in the environment, 6
in this case.
The input and output of the used Deep Q network, is repre-
sented in Figure 17 and shows that the Deep Q network receives
the current state 𝑠𝑖, which can be any number between 0 and 7 (see
Fig. 16), as input, and processes this state to get the values for all
Q-values related to the state 𝑠𝑖, which in this case are 6, related to
the action space.
𝑠𝑖Deep Q
𝑄(𝑠𝑖, 𝑟𝑒𝑞 1)
𝑄(𝑠𝑖, 𝑟𝑒𝑞 2)
𝑄(𝑠𝑖, 𝑡𝑟 𝑎𝑛1)
𝑄(𝑠𝑖, 𝑡𝑟 𝑎𝑛2)
𝑄(𝑠𝑖, 𝑎𝑐𝑘1)
𝑄(𝑠𝑖, 𝑎𝑐𝑘2)
Figure 17: Deep Q system for the transmitters example.
Upon training for 100 episodes, the resulting Q-table is shown
in Table 8.
2020-06-29 13:25. Page 14 of 1–16.
Concept and the implementation of a tool to convert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper
𝑎𝑐𝑘1𝑎𝑐𝑘2𝑟𝑒𝑞1𝑟 𝑒𝑞2𝑡𝑟𝑎𝑛1𝑡𝑟𝑎𝑛2
St. 0 0.01 -0.02 5.28 6.67 0.01 -0.04
St. 1 0.04 -0.00 6.36 0.05 -0.02 7.75
St. 2 -0.04 8.85 6.94 -0.01 0.04 -0.04
St. 3 -0.03 -0.01 -0.04 6.36 5.87 0.02
St. 4 -0.01 -0.05 0.02 0.00 7.46 7.16
St. 5 -0.03 9.29 0.03 0.04 0.02 -0.04
St. 6 6.64 0.00 0.02 6.94 -0.00 -0.02
St. 7 9.67 -0.04 -0.02 -0.05 0.02 -0.03
Table 8: Q-table representing the training for the 2 transmit-
ters example.
Note that there is only a few values in the table that are not close
to zero. These values correspond to the transitions that the FSM
cannot trigger. As the neural network does not differ between pos-
sible and impossible transitions, it continues to update its weights
in order to minimize the loss function in Equation 6, and approx-
imates these Q-values to zero, since impossible transitions have a
value zero in the Q-table.
Also remark that the only state in which the agent chooses be-
tween two or more controllable actions is the state 0. It has the
option to choose between actions 𝑟 𝑒𝑞1and 𝑟𝑒𝑞2, and since using
the transmitter 2 impacts on receiving a reward of +3, a few steps
forward with the uncontrollable signal 𝑎𝑐𝑘2, the Q-value for 𝑟𝑒𝑞2
is higher, indicating that the agent prefer it.
7 FUTURE OPTIMIZATIONS
In this paper, we are considering that the probability distribution is
uniform among all events. In forward studies we pretend to work
on adapting these probabilities to other kinds of distributions, like
exponential distribution. For example, in the first day (episode) a
machine have 5% of chance of crashing, while in the second day
has 10%, and in subsequent days this probability is going to get
higher and higher.
Another important modification is to turn possible not only the
parsing through XML structures, but also in other Supremica out-
put formats, like WMOD. This will turn easier for researchers to
simply export any structure of automaton to gym and begin their
training without worrying about the file’s structure.
There is also a possibility of implementing a neural network
with variable neurons in the output layer. In case of Section 6.2 ex-
ample, the implementation of the Deep Q network considered that
for all the states in the environment, we have all possible actions to
be taken, but when working with FSMs, there is only some actions
that are possible to be taken in a single state, so, Table 8showed us
that the neural network used wasted some processing and space
that is not necessary in this case. Also, on complex systems with
a huge state space, larger neural networks are needed, and also
requires more complex structures, like recurrent LSTM cells.
8 FINAL CONSIDERATIONS
This article aimed to develop a tool that solves problems related to
industry 4.0, in which are possible to be modeled as discrete event
systems using reinforcement learning methods.
The work not only explored the tool itself but also pointed out
differences between RL and DES environment and focused on solv-
ing them in order to create methods capable of transforming DESs
into RL environments. The similarity between both types of envi-
ronments allowed the creation of an easy-to-use tool capable of
optimizing DES models with training via RL.
The use of RL in DESs evidences practical appeal, considering
that the reward system used in RL algorithms can reflect many as-
pects in 4.0 industry modeled as DESs, like the cost of using factory
machines. On the other side, there is a profit if some machines pro-
duce well. Then, RL emerges as an alternative that can anticipate
whether or not certain actions are attractive for the factory. The
price to be paid is a minor engineering effort: engineers have to
simply provide the input model constructed as a FSM, and a re-
ward list for each triggered event in the system. They have also
the option to add probabilities for uncontrollable events.
Finally, we believe that the paper and the tool have potential
to serve as a foundation for future studies involving RL and DESs,
specially in industry 4.0-aware scenarios.
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, and others. 2015. Ten-
sorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
http://tensorflow.org/ Software available from tensorflow.org.
[2] T. Bangemann, M. Riedl, M. Thron, and C. Diedrich. 2016. Integration of Classical
Components Into Industrial Cyber-Physical Systems. Proc. IEEE 104, 5 (May
2016), 947–959.
[3] A. V. Bernstein and E. V. Burnaev. 2018. Reinforcement learning in com-
puter vision. In Tenth International Conference on Machine Vision (ICMV 2017),
Antanas Verikas, Petia Radeva, Dmitry Nikolaev, and Jianhong Zhou (Eds.),
Vol. 10696. International Society for Optics and Photonics, SPIE, 458 – 464.
https://doi.org/10.1117/12.2309945
[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, and others. 2016. OpenAI
Gym. arXiv:arXiv:1606.01540
[5] Christos G Cassandras and Stephane Lafortune. 2009. Introduction to discrete
event systems. Springer Science & Business Media.
[6] R. Drath and A. Horch. 2014. Industrie 4.0: Hit or Hype? [Industry Forum]. IEEE
Industrial Electronics Magazine 8, 2 (June 2014), 56–58.
[7] R. Harrison, D. Vera, and B. Ahmad. 2016. Engineering Methods and Tools for
Cyber-Physical Automation Systems. Proc. IEEE 104, 5 (May 2016), 973–985.
[8] Daniel Hein, Stefan Depeweg, Michel Tokic, and others. 2017. A Benchmark
Environment Motivated by Industrial Control Problems. (09 2017), 1–8.
[9] L. P. Kaelbling, M. L. Littman, and A. W. Moore. 1996. Reinforcement learning:
A survey. Journal of Artificial Intelligence Research 4 (1996), 237–285.
[10] Henning Kagermann, Wolfgang Wahlster, and Johannes Helbig. 2013. Recom-
mendations for implementing the strategic initiative INDUSTRIE 4.0. Final re-
port of the Industrie 4.0 Working Group (April 2013), 1–82.
[11] Henning Kagermann, Wolfgang Wahlster, and Johannes Helbig. 2013. Recom-
mendations for Implementing the Strategic Initiative INDUSTRIE 4.0 – Securing the
Future of German Manufacturing Industry. Final Report of the Industrie 4.0 Work-
ing Group. acatech – National Academy of Science and Engineering, München.
http://forschungsunion.de/pdf/industrie_4_0_final_report.pdf
[12] Kallilmiguel. 2019. kallilmiguel/automata_gym.
https://github.com/kallilmiguel/automata_gym
[13] Jens Kober, J. Bagnell, and Jan Peters. 2013. Reinforcement Learning in Robotics:
A Survey. The International Journal of Robotics Research 32 (09 2013), 1238–1274.
https://doi.org/10.1177/0278364913495721
[14] Maxim Lapan. 2018. Deep reinforcement learning hands-on : apply modern RL
methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo
Zero and more. Packt Publishing, Birmingham, UK.
[15] YuxiLi. 2017. Deep Reinforcement Learning: An Overview. CoRR abs/1701.07274
(2017). arXiv:1701.07274 http://arxiv.org/abs/1701.07274
[16] Y. Liu, Y. Peng, B. Wang, S. Yao, and Z. Liu. 2017. Review on cyber-physical
systems. IEEE/CAA Journal of Automatica Sinica 4, 1 (Jan 2017), 27–40.
2020-06-29 13:25. Page 15 of 1–16.
Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalcimar Casanova
[17] Jelena Luketina, Nantas Nardelli, Gregory Farquhar, and others. 2019. A Survey
of Reinforcement Learning Informed by Natural Language. CoRR abs/1906.03926
(2019). arXiv:1906.03926 http://arxiv.org/abs/1906.03926
[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, and others. 2013. Playing
Atari with Deep Reinforcement Learning. arXiv:arXiv:1312.5602
[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, and others. 2015. Human-
level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015),
529–533. https://doi.org/10.1038/nature14236
[20] László Monostori. 2014. Cyber-physical Production Systems: Roots,
Expectations and R&D Challenges. Procedia CIRP 17 (2014), 9–13.
https://doi.org/10.1016/j.procir.2014.03.115
[21] OpenAI. [n.d.]. A toolkit for developing and comparing reinforcement learning
algorithms. https://gym.openai.com/
[22] Adam Paszke, Sam Gross, Francisco Massa, and others. 2019. PyTorch: An Im-
perative Style, High-Performance Deep Learning Library. In Advances in Neu-
ral Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.
[23] Yassine Qamsane, Mahmoud El Hamlaoui, Tajer Abdelouahed, and Alexandre
Philippot. 2018. A Model-Based Transformation Method to Design PLC-Based
Control of Discrete Automated Manufacturing Systems. In International Confer-
ence on Automation, Control, Engineering and Computer Science. Sousse, Tunisia,
4–11.
[24] P.J.G. Ramadge and W.M. Wonham. 1989. The control of discrete event systems.
Proc. IEEE 77, 1 (1989), 81–98. https://doi.org/10.1109/5.21072
[25] Ferdie F. H. Reijnen, Martijn A. Goorden, Joanna M. van de Mortel-Fronczak,
and Jacobus E. Rooda. 2020. Modeling for supervisor synthesis – a lock-bridge
combination case study. Discrete Event Dynamic Systems 1 (2020), 279–292.
[26] Stuart J. Russell and Peter Norvig. 2009. Artificial Intelligence: a modern approach
(3 ed.). Pearson.
[27] André Lucas Silva, Richardson Ribeiro, and Marcelo Teixeira. 2017. Modeling
and control of flexible context-dependent manufacturing systems. Information
Sciences 421 (2017), 1 – 14.
[28] David Silver, Aja Huang, Chris J. Maddison, and others. 2016. Mastering the
game of Go with deep neural networks and tree search. Nature 529, 7587 (Jan.
2016), 484–489. https://doi.org/10.1038/nature16961
[29] Phil Simon. 2013. Too Big to Ignore: The Business Case for Big Data (1st ed.). Wiley
Publishing.
[30] Richard S. Sutton. 1991. Dyna, an Integrated Architecture for Learn-
ing, Planning, and Reacting. SIGART Bull. 2, 4 (July 1991), 160–163.
https://doi.org/10.1145/122344.122377
[31] Richard S. Sutton and Andrew G. Barto. 2018. Reinforce-
ment Learning: An Introduction (second ed.). The MIT Press.
http://incompleteideas.net/book/the-book- 2nd.html
[32] Richard S Sutton, Andrew G Barto, et al. 1998. Reinforcement learning: An intro-
duction. MIT press.
[33] Prasad Tadepalli and Dokyeong Ok. 1996. H-learning: A Reinforcement Learn-
ing Method to Optimize Undiscounted Average Reward. (03 1996).
[34] Gerald Tesauro. 2002. Programming backgammon using self-teaching
neural nets. Artificial Intelligence 134, 1-2 (Jan. 2002), 181–199.
https://doi.org/10.1016/s0004-3702(01)00110- 2
[35] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. In Machine
Learning. 279–292.
[36] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learn-
ing 8, 3-4 (May 1992), 279–292. https://doi.org/10.1007/bf00992698
[37] Michael Wooldridge and Nicholas R. Jennings. 1995. Intelligent agents: the-
ory and practice. The Knowledge Engineering Review 10, 2 (1995), 115–152.
https://doi.org/10.1017/S0269888900008122
[38] Kallil M. C. Zielinski, Marcelo Teixeira, Richardson Ribeiro, and Dalci-
mar Casanova. 2020. Concept and the implementation of a tool to con-
vert industry 4.0 environments modeled as FSM to an OpenAI Gym wrapper.
https://github.com/kallilmiguel/automata_gym
[39] Knut Åkesson, Martin Fabian, Hugo Flordal, and Robi Malik. 2006. Supremica -
An integrated environment for verification, synthesis and simulation of discrete
event systems. Proceedings - Eighth International Workshop on Discrete Event
Systems, WODES 2006, 384 – 385. https://doi.org/10.1109/WODES.2006.382401
2020-06-29 13:25. Page 16 of 1–16.