PreprintPDF Available

Control of RTM processes through Deep Reinforcement Learning

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Resin transfer molding (RTM) is a composite manufacturing process that uses a liquid polymer matrix to create complex-shaped parts. There are several challenges associated with RTM. One of the main challenges is ensuring that the liquid polymer matrix is properly distributed throughout the composite material during the molding process. If the matrix is not evenly distributed, the resulting part may have weak or inconsistent properties. This is the challenge we tackle with the approach presented in this work. We implement an online control using deep reinforcement learning (RL) to ensure a complete impregnation of the reinforcing fibers during the injection phase, by controlling the input pressure on different inlets. This work uses this self-learning paradigm to actively control the injection of an RTM process, which has the advantage of depending on a reward function instead of a mathematical model, which would be the case for model predictive control. A reward function is more straightforward to model and can be applied and adapted to more complex problems. RL algorithms have to be trained through many iterations, for which we developed a simulation environment with a distributed and parallel architecture. We show that the presented approach decreases the failure rate from 54 % to 27 %, by 50 % compared to the same setup with steady parameters.
Content may be subject to copyright.
Control of RTM processes through Deep
Reinforcement Learning
Simon Stieber, Leonard Heber, Christof Obertscheider, Wolfgang Reif
Institute for Software & Systems Engineering, University of Augsburg, Augsburg, Germany
Email: {simon.stieber, leonard.heber, wolfgang.reif}
University of Applied Sciences Wiener Neustadt, Wiener Neustadt, Austria
Abstract—Resin transfer molding (RTM) is a composite man-
ufacturing process that uses a liquid polymer matrix to create
complex-shaped parts. There are several challenges associated
with RTM. One of the main challenges is ensuring that the
liquid polymer matrix is properly distributed throughout the
composite material during the molding process. If the matrix
is not evenly distributed, the resulting part may have weak or
inconsistent properties. This is the challenge we tackle with the
approach presented in this work. We implement an online control
using deep reinforcement learning (RL) to ensure a complete
impregnation of the reinforcing fibers during the injection phase,
by controlling the input pressure on different inlets. This work
uses this self-learning paradigm to actively control the injection
of an RTM process, which has the advantage of depending on a
reward function instead of a mathematical model, which would
be the case for model predictive control. A reward function is
more straightforward to model and can be applied and adapted
to more complex problems. RL algorithms have to be trained
through many iterations, for which we developed a simulation
environment with a distributed and parallel architecture. We
show that the presented approach decreases the failure rate from
54 % to 27 %, by 50 % compared to the same setup with steady
Index Terms—Reinforcement Learning, Resin Transfer Mold-
ing, Control
Fiber-reinforced plastics (FRP) are a composite material,
whereby conventional plastics are raised to a higher mechani-
cal level by the introduction of reinforcing fibers. Components,
especially when reinforced with carbon fiber, are very light
and can withstand large forces in the direction of the fibers.
These properties can be exploited to save weight and increase
stiffness compared to conventional components made of steel
or aluminum [1]. Further, they render this material especially
interesting for the aviation and automotive industry since
weight savings lead to lower energy consumption, and thus
FRP are a key to a more environmentally friendly mobility.
The RTM process (resin transfer molding) [2] is a widely
used industrial process for manufacturing such components.
Characteristic is the usage of a closed mold whose shape
resembles the final part. The fibers are used in form of a textile,
which for example can be woven. Multiple layers of textile can
stacked to build a textile preform, which is placed in the mold.
Next, a liquid polymer matrix, i.e. the resin, is injected into
the mold via several sprues. After the resin has cured, the part
can be removed from the mold. Usually the process is carried
out in a intransparent mold. This makes it easier to construct
the mold, because it has to withstand high pressures during the
resin injection. Yet, especially in experimental work, it can be
desirable to visually track the resin flow. Other approaches and
processes, including the usage of transparent mold and sensor
arrays, will be presented in Section II.
During the injection of the resin, disturbances that stem
from the irregular nature of the textile can occur that lead
to an incomplete impregnation of the textile, which reduces
the stability of the manufactured component. This leads to
a high reject rate, which can make the production of FRP
components uneconomical and unecological. The border be-
tween already impregnated areas and dry areas is called the
flow front. An irregular spreading of the flow front can lead
to an air entrapment in the worst case, which is called a
dry spot. By controlling the pressure at which the resin is
injected into the mold, the flow front can be influenced in
order to ensure a uniform, complete and fast impregnation
of the textile. One difficulty in designing such a controller is
the nonlinear and complex relationship between the injection
pressure and the spreading of the resin [3]. This complicates
the classical expert-based controller design, which is based on
the formation of a mathematical model of the controlled sys-
tem. Therefore, data-driven approaches using various machine
learning models have been developed in related work [3]–
[5]. Initial successes have been achieved, resulting in a more
regular spreading of the resin.
In this work, we use Reinforcement Learning (RL) to
optimize the injection pressure in an RTM process. A system
trained by an RL algorithm is called an agent. This agent
learns to interact with its environment through trial and error.
In doing so, it is driven by rewards it receives in response to
its actions, which it tries to maximize [6]. The experiments
for this work were performed using a simulation of the RTM
process. Running a real machine would be too expensive
and burdensome to perform the number of runs necessary
to train the algorithms. In addition, a distributed and parallel
architecture was implemented to take advantage of available
computational resources during training. The overall approach
Fig. 1. Overview of our setup: The agent receives an observation and a reward and chooses an action. The simulation acts as the environment which takes
the action and returns the next observation, containing the flow front image, the fiber volume content (FVC) map, and the pressure image.
is depicted in Fig. 1.
Previous work on the optimization of the RTM process can
be divided into passive and active methods. Passive methods
are used to optimize various process parameters in advance,
with no further influence being exerted during ongoing pro-
duction. Szarski et al. [7] used RL to optimize fluid flow in a
process similar to RTM by determining the placement of flow
enhancers a priori.
The active, or online, methods are applied during the process
to control certain parameters based on real-time measurements.
Thus, unforeseen disturbances can be reacted to. The approach
of this work can be classified as an active method, where the
measurement is an image of the flow front, inter alia, and the
variable to be controlled, the actuator, is the pressure profile
applied to the resin inlets. Demirci and Coulter [4] trained
an artificial neural network using a numerical simulation to
predict the flow front position at the next discrete time step
from the flow front and the injection pressure. They defined
an optimal flow front as the control target. The output of the
controller is the pressure profile that minimizes the difference
between the flow front position predicted by the model and
the predefined target flow front. To determine this pressure
profile, they tested several optimization methods, of which the
Downhill Simplex method of Nelder and Mead [8] proved to be
the most efficient [9]. Nielsen and Pitchumani also trained an
artificial neural network to predict the future fluid advancement
by using a numerical simulation [10]. To determine the optimal
pressure profile, they used the Simulated Annealing method,
a heuristic optimization procedure [11]. In another work, they
extended their model to include a fuzzy logic estimate of the
permeability of the textile [12]. In real experimental setups,
the resulting flow front was found to sufficiently approximate
the desired one. Wang et al. [3] designed a model predictive
controller (MPC) with a flow front specification. They use an
autoregressive model with exogenous input (ARX) to account
for the nonlinear characteristics of the RTM process. The
parameters of the model can be identified online, i.e. at run-
time, by a recursive least squares method. The controller was
tested on a rectangular plate component, with an additional
textile layer added to create an obstacle to the flow front. A
response to the disturbance was evident in the applied pressure
profiles. Those works have in common that they measure the
performance of their approach by measuring the difference of
the actual flow front from an optimal flow front, which would
be a perpendicular line in the linear injection from one side
Our work differs from those approaches because model-free
RL approximates functions that map measured values directly
to the next action. Therefore, no model of the process is needed
and no elaborate optimization needs to be performed at each
step. Further, we do not use the straightness of the flow front
as a metric per se, but we optimize to reduce dry spots and
thus reduce the failure rate of the process. The form of the
flow front is a factor in our proposed reward function (cf.
Section IV-C).
Another area of research concerning the RTM process is
online monitoring and analysis. The goal is to obtain as much
information as possible about the running process, even though
it takes place in a closed tool, in order to perform analysis and
optimization. In the implementation of this work, such issues
were rather secondary, since the approach is tested exclusively
in a simulative environment. However, since the motivation
of this work is real-world implementation, a number of pa-
pers dealing with this are presented here. Stieber et al. [13]
developed a concept for building an RTM process, where a
digital twin of the process could be implemented by using a
variety of different sensors and a simulation. The focus was the
in-situ monitoring, using for example pressure sensors inside
the mold. In another work [14], coarse pressure sensor data
providing binary information about the flow front progression
at certain points within the mold was used to generate images
of the flow front using machine learning models, which are
directly obtainable, since the tooling is intransparent. Those
images were then analyzed for the occurrence of dry spots,
with the goal of reducing the average cycle time of the process
by stopping faulty runs as early as possible. Other works [12],
[15]–[17] focused on the prediction of the material properties
from the resin flow to build additional information on the
process and the finished product. Gr¨
ossing et al. [18] assessed
how well different simulation programs can reproduce the
resin flow behavior of the RTM process. For comparison with
reality, they used a setup in which the upper half of the mold
is transparent. This allowed pictures of the flow front to be
taken with a camera.
RL is a paradigm of machine learning that involves an agent
that learns by interacting with an environment and receiving
feedback in the form of rewards or penalties. The goal is
to learn an optimal policy, which is a mapping of states
to actions that will maximize the reward over a series of
interactions. A sequence of interactions is called a trajectory or
if it is of finite length, an episode. RL algorithms define how
experienced state transitions are used to improve the agent’s
policy and can roughly be classified into two groups: model-
based and model-free methods. The algorithms employed here,
A2C and PPO belong to the model free methods, learning
policies solely from experienced state transitions and received
rewards while having no prior knowledge of the environment’s
dynamics nor trying to model them. Actor critic RL algorithms
combine two early model-free approaches, the value-based
and the policy-based methods. Value-based methods seek to
estimate the long-term value of states, which reduces the
task of finding optimal action sequences to choosing the
action immediately yielding the highest value. In policy-based
methods, the goal is to approximate a policy function that
maps states to actions. One way to achieve this is the usage of
the policy gradient theorem [6], which leads to a differentiable
expression describing the influence of an action on the received
rewards. This enables the use of gradient descent methods
to optimize the policy function based on experienced trajec-
tories. Actor-critic algorithms simultaneously train a policy
and a value function. The prediction of a state’s long-term
value is used to optimize the policy. This leads to a higher
sample efficiency and better convergence properties [19]. The
usage of deep neural networks to approximate those functions
makes those methods viable for complex real-world problems,
because it enables them to interact with high-dimensional
observations, like images, and to cope with nonlinear policies.
They are called deep reinforcement learning (DRL) methods
Advantage actor-critic (A2C) [20] implements the actor-
critic pattern by using the advantage function as the critic. The
advantage function estimates the advantage that is gained from
taking a certain action in a certain state. By optimizing the
policy with respect to the advantage estimate, the probability
to take actions, that lead to high rewards, rises.
Proximal policy optimization (PPO) [21] works similarly to
A2C, but introduces a surrogate objective to replace the simple
advantage estimate. It has been found that too large policy
updates can cause instabilities. That led to the development
of trust-region methods that limit the policy update per step,
which results in a more monotonic improvement [22]. PPO
uses a clipping mechanism, which is more efficient, easier to
implement, and broader applicable than former trust region
In the following the experimental setup including the sim-
ulation, the RL controller and RL hyperparameters, and the
actual experiment plan are described.
A. Simulative Environment
For our application, as it is common for DRL models, train-
ing in s real-world setup is not feasible, since a large number
of iterations are required, which would result in many costly
experiments. Therefore we used a numerical simulation of
the resin flow to provide a simulative environment resembling
the RTM process. Existing simulation programs didn’t match
our requirements, so we implemented a simulation based on
RTMSim [23]. The resin flow through a porous medium, i.e.
a textile can be described with Darcy’s law, which defines the
flow speed as shown in equation 1.
η· p(1)
pdenotes the pressure gradient between two points, K
describes the permeability of the textile and ηis the viscosity
of the resin. To be able to execute comparable experiments,
some assumptions about process parameters had to be made.
Their values were chosen to be constant within realistic
magnitudes but would vary, if, for example, other types of
resin or textile were used. We assume ηto be 0,1Pas and K
to be isotropic, meaning its value is the same in every flow
direction. The exact value of Kis varied between experiments
and will be explained later on.
RTMSim applies a finite area method to solve the resin flow
on the whole part, which introduces the need for temporal and
spatial discretization. While the simulation requires compara-
bly small time steps to yield numerically stable solutions, we
chose a much larger step size of 0,5s - or a frequency of 2Hz
- for the RL cycle. This reduces the computational burden,
which is necessary to efficiently train agents. Therefore, per
RL step, a multitude of simulation steps is executed. In the
spatial domain, we use a mesh that consists of 1878 triangle
elements and models a planar quadratic part, as depicted
in Fig. 2. The part has a side length of 50 cm and a thickness
of 0.5cm. Three equally wide and independently controllable
resin inlets are placed on the left side. In order to simulate
irregularities in the textile preform and provoke perturbations
of the flow front, inserts are placed on the part. An insert
is an area whose fiber volume content (FVC) - and thus the
permeability - deviates from the standard value, which we
chose to be 35 %, which leads to Kequals 1,464 ×109m2
in the basic textile. In related work, it has become common to
use rectangular inserts to provoke and analyze perturbations of
the flow front [3], [14]. The position, the dimension, and the
degree of FVC deviation determine how hard the control task
0 10 20 30 40 50
Length in cm
Width in cm
Inlet 1
Inlet 2
Inlet 3
Inlet Cells
0 10 20 30 40 50
Length in cm
Width in cm
FVC Values
Fig. 2. Left: Three different inlet cell groups that can be actuated indepen-
dently are depicted in different colors. Right: FVC contents in different areas
of the preform. Patch with higher FVC.
is. During training, the placement is drawn from a random
distribution, which is subject to certain constraints in each
experiment that will be explained in section IV-E.
Special requirements, that defer from commonly used sim-
ulation tools such as PAM RTM [24] are the possibility to
change the injection pressure and obtain the state of the
simulation at any point in simulated time. This ability is crucial
for online control and needs to be possible without terminating
the executing process to achieve high efficiency and stability.
We created a lightweight program by stripping down the
implementation of RTMSim to the specific case we need to
simulate. This enabled us to run multiple parallel instances of
the simulation on a multi-node compute cluster. Distributed on
9 servers our architecture can provide 279 virtual instances of
the RTM process for an agent to train with while using one
additional server for mid-training validation. A training run
consists of 2,000,000 steps, which took on average 5hours
and 29 minutes, including validation. This equals 60,000 to
100,000 filling cycles for agents to gain experience from,
depending on the average episode length of each experiment.
B. RL Controller
The agent interacts with all parallel environments in a
synchronous manner and the experiences are batched and
accumulated via stochastic gradient descent. The interaction
with every single environment is similar to how an agent would
be integrated into the RTM process as a controller, as can be
shown in Fig. 1. At each discrete step the agent receives an
observation and chooses an action. The action consists of three
integer values that control the injection pressure at the three
resin inlets, which can be set to five discrete equidistant levels
between 0,1and 5bar.
We experimented with three different observation spaces to
evaluate which physical quantities yield the most value for
the agent. The considered quantities are the filling state, the
preform FVC and the pressure inside the tool. The filling
state represents the spreading of the resin, which is to be
optimized and therefore included. The flow speed and thus
the flow front is, as stated by Darcy’s law, influenced by
the permeability and the pressure gradient. We substitute the
permeability information with the FVC of the textile, as it is
usually easier to obtain in real-world scenarios and, according
to our assumptions regarding the textile, the only factor influ-
encing the permeability. Instead of a pressure gradient map,
we provide the agents with a simple map of the pressure inside
the mold, which is, again, more realistic to measure in a real
process. In our simulation, all of those quantities are observed
as 50 ×50 pixel grayscale images that display their spatial
distribution. While the FVC remains constant throughout one
filling cycle, the filling state and pressure evolve with time.
The simplest observation space containing only the flow front
image is from now on called Ff, adding the FVC map gives
the observation space FfFvc and adding the pressure image
leads to FfFvcP.
A filling cycle consists of a series of interactions and ends
when one terminal condition is fulfilled. This can be either the
occurrence of a dry spot or the complete filling of the part,
each of which will trigger a specific reward signal contributing
to the reward function described in Section IV-C. A filling
cycle definitely terminates in finite time, rendering this case
an episodic problem in terms of RL. During training, finished
simulation instances are automatically reset with randomized
initial conditions concerning the preform FVC and thus the
permeability, as described in section IV-A.
C. Reward Function
We designed a reward function that depends on the flow
front image only. The main purpose of our reward function
(Eq. (2)) is to reward the complete filling of the part and
punish the occurrence of dry spots.
r(o) = a·filled
(Ri)·(1 oi,j )2(2)
filled and dryspot take either values of 0or 1and indicate if
the respective event has occurred. Both can only be triggered
in terminal states because the environments automatically reset
in either case. The input to the reward function is the flow front
image o, while the notation oi,j refers to the pixel in the i-th
row and j-th column. his the height of o, which is 50 in our
experiments, and Ris the column index of the rightmost pixel
that has been reached by the resin. nsteps counts the number
of steps elapsed since the beginning of the episode and the
weighting factors have been chosen as a= 3000,b= 100 and
c= 10 in prior experiments. In an episode of usual length -
when nsteps is around 30 - the choice of aand bcauses the
first two terms to be of the same magnitude.
Apart from the sparse reward mechanism, which only yields
a signal - either positive or negative - in terminal states, we
added an auxiliary goal to guide the agent to the desired
behavior. This introduces the third term, which measures the
flow front uniformity. In our setup, the optimal flow front
would be an orthogonal line moving from left to right. When
evaluating the reward function, we place this target flow front
at R. Then the mean squared error between oand this target,
weighted with each pixel’s distance to R, is calculated, with
the exception that the two columns nearest to the target line
are excluded. Thereby, small irregularities have a comparably
small impact on the reward signal, whereas deeper bulges are
weighted quadratically higher. This motivates the agent to keep
the flow front as even as possible, which is a good step toward
preventing dry spots.
Two mechanisms implicitly add the incentive to finish
episodes quickly. By weighting filled inversely with nsteps,
the agent receives a higher reward at the end of short episodes.
Because the deviation from the target flow front is weighted
negatively, the reward signal is negative for all states, except
for possibly a terminal one. In order to maximize the rewards
accumulated over an episode, the agent should finish the
episode in as few steps as possible, while still seeking to fill
the part completely by avoiding dry spots.
D. RL Parameters
We used the package stable-baselines3 [25], which, inter
alia, provides implementations of A2C and PPO with pre-
tuned hyperparameters1. We adjusted the parameter n_steps
to 20, which sets the number of steps to include per policy
update. This improved the agents’ performance, while for the
other parameters, no changes were found to be beneficial.
Regarding the neural network trained by the algorithms, we
used the same architecture in all experiments. Parts of the
network are shared between the policy and the value function.
The shared part consists of three convolutional layers followed
by a feed forward layer of width 128. The first convolutional
layer uses 32 kernels of size 8×8with stride 4, the second 64
kernels of size 4×4with stride 2, and the third 64 kernels of
size 3×3with stride 1. This convolutional network architecture
was used by Mnih et al. [20] in an influential work on the
application of DRL and an implementation is provided by
stable-baselines3. Next, the network splits into two heads, each
containing a feed forward layer of width 32. The output of the
policy contains three values and represents the action of the
agent, while the value network predicts one value. In all layers,
ReLU [26] is used as activation function and Adam [27] as
E. Series of experiments
Two series of experiments are presented in this paper. Each
series contains six experiments resulting from combining the
three possible observation spaces with the two considered
algorithms. During training, the insert parameters are drawn
from experiment-specific random distributions, which will be
explained in this section. To be able to compare agents within
one series of experiments, we evaluate them on test sets of 100
parts, that were created according to the same distribution as
used in training.
1The standard hyperparameters of A2C: https://stable-baselines3. and PPO: https:
Slight Pert. Strong Pert.
# and shape of insert 1Rect. 1Rect.
Height of insert in cm 21 ±1 15 ±1
Width of insert in cm 16 ±1 15 ±1
FVC preform in % 35 35
Perm. preform in m21,464 ×1091,464 ×109
FVC patch in % 42 45
Perm. patch in m23,969 ×1010 2,268 ×1010
% dry spots 0 54
In the first set of experiments inserts are placed and varied
in such a way that slight perturbations of the flow front occur
when injecting the fluid with steady pressure. Thus, the flow
front lags behind significantly when it passes an insert, but no
dry spots are formed. Rectangular inserts of height 21 ±1cm
and width 16 ±1cm are used for this purpose. The setup
of both experimental campaigns, especially the differences
are presented in Table I. These are placed randomly on the
component with the restriction that a strip of 5cm width is
excluded from each of the left and right edges. The FVC of the
insert is 42 %, which equals a permeability of 3,969 ×1010
m2. The results are presented in section V-A.
In the second series of experiments, strong perturbations, up
to the formation of dry spots, were provoked. In preliminary
tests, agents tended to have greater control of the flow front in
the first third of the part than in areas far from the resin inlets.
This can be explained by Darcy’s law, according to which the
flow velocity is proportional to the pressure gradient. Close
to the inlets, this can be strong and immediately changed
by applying different pressure values, while the influence
decreases when the flow front is farther away. Therefore the
inserts are placed only in the left third of the part to test
whether an agent of RL is in principle able to prevent the
formation of dry spots. The inserts are of square shape and
have a side length of 15 ±1cm. The inserts have a FVC of
45 % and a permeability of 2,268 ×1010 m2. They were
placed 5cm from the inlets and 15 cm from the outlets,
since perturbations closer to the inlets made it possible to still
change the flow front sufficiently. Additionally, a 5cm margin
is applied to the upper and lower edge to avoid cases where
a dry spot touches a border.
An uncontrolled injection process leads to strong irregular-
ities of the flow front as soon as it reaches an insert. In 54
% of the cases within the test set, this ultimately leads to the
formation of a dry spot, causing filling cycle to be prematurely
terminated. It is noteworthy that the difference between slight
and strong perturbations is only 3% in FVC. The amount of
change in FVC necessary to provoke enclosures of dry textile,
i.e. dry spots was determined experimentally. The results of
these experiments are described in section V-B.
In the following chapter, the results of these two series of
experiments are presented to assess whether RL can provide
Algorithm /
Observation Space
Mean Reward
per Episode
Mean Length
Uncontrolled 107.829.52
PPO/Ff 51.2 30.59
A2C/Ff 110.2 29.77
PPO/FfFvc 23.2 31.53
A2C/FfFvc 126.5 29.96
PPO/FfFvcP 22.232.84
A2C/FfFvcP 76.7 32.55
an advantage in controlling the RTM process.
A common metric in RL applications is the average accumu-
lated reward per episode [6]. Another measure is the average
number of steps per episode, i.e., how long a filling cycle
takes on average. The duration in seconds is given by half
the number of steps. From an economic point of view, it is
desirable to achieve the shortest possible cycle time of the
RTM process. However, very short episodes can also mean
that filling cycles were aborted early because a dry spot was
detected. Therefore, when considering the episode length, the
specific data set must be analyzed to determine whether and
how often dry spots occur. In such cases, the adjusted average
episode length can be used, which considers only successful
filling cycles. In addition, if dry spots occur in the data set
used for evaluation, the rate of failed episodes can be used.
While this information is implicitly included in the cumulative
rewards, since the occurrence of a dry spot is penalized by
the reward function, the reward signals are also influenced by
other factors. Therefore, it may be advantageous to explicitly
calculate the dry spot rate, which, in this formulation, shall
be minimized. By comparing the strategies, it is possible
to evaluate which algorithm gives the best result in which
configuration, but there is no indication of whether an overall
advantage can be obtained for the control of the RTM process.
Therefore, the learned strategies are additionally compared
with a constant baseline strategy that applies the maximum
possible injection pressure of 5bar to each gate in each step.
This corresponds to the uncontrolled version of the RTM
process commonly employed industrially [1].
A. Slight perturbations
In this series of experiments, minor disturbances of the
flow front occur during an uncontrolled injection due to slight
perturbations of the preform permeability. Table II shows
the average rewards per episode and the average episode
lengths resulting from the different pairings of algorithms and
observation spaces.
For comparison, the values obtained by an uncontrolled
injection are given. Since no dry spots occur in this setup, the
rate of dry spots is not used as a metric. The average episode
length, on the other hand, is used as a quality measure of
an agent without restriction, since episodes are not terminated
prematurely and shorter episodes are desirable. Based on the
metrics shown in Table II on the test data set, it can be
seen that PPO is better suited for this use case than A2C.
While PPO agents achieve a higher average reward than the
uncontrolled injection process in all configurations, the FVC
map seems be crucial. With access to this information, an
agent achieves a significantly higher reward, while adding the
pressure information provides only a marginal advantage. In
contrast, agents trained by A2C only manage to learn a strategy
that can achieve an advantage over the uncontrolled process
when having access to the full information. Furthermore, it is
noticeable that agents achieving higher rewards usually take
longer to complete an episode. To reach an even flow front,
the agent has to match the flow speed in higher permeability
areas to the lower speed that is possible when impregnating
the lower permeability inserts, which causes the filling cycle
to take longer in general. On the contrary, agents behaving
similarly to the uncontrolled process are also nearly as fast.
These observations are illustrated below by comparing two
differently trained agents. In Fig. 3, the behaviors of the
PPO/Ff agent and the PPO/FfFvcP agent are compared. For
this purpose, three snapshots of two filling processes, resulting
from applying the respective control strategies to the same test
case, are shown. The agent PPO/FfFvcP has an advantage due
to its knowledge of the local FVC since it can preemptively
steer against flow front disturbances that will happen in later
process stages. In the beginning, agent PPO/FfFvcP applies
the maximum possible value of 5bar relatively constant to
gate 3, while gradually lower values are selected for gate 2
an gate 1. This is displayed in the action plot in Fig. 3. As
a result, the image taken after 1s shows a curvature towards
the insert. This causes a negative reward signal, as can be
read from the reward plot. When passing the insert, the flow
front is slightly delayed and is subsequently almost straight.
In consequence, the reward signals received per step are
nearly zero and the cumulative rewards remain almost constant
until the positive signal for finishing the episode successfully
is triggered. Towards the end of the filling process, there
are clear fluctuations in the actions of agent PPO/FfFvcP,
but these have little influence on the flow front since it is
already far away from the resin inlets. Agent PPO/Ff, in
contrast, can only react to disturbances in the flow front that
have already occurred. Since the process reacts sluggishly to
changes in the injection pressure, especially when the flow
front is already more advanced, this agent can often achieve
only slight advantages over the uncontrolled process. Agent
PPO/Ff tries to create a straight flow front from the beginning
by applying uniform, high-pressure values. However, already
after 1s a slight disturbance can be seen since the flow front
has already reached the insert. The agent reacts to this by
slightly reducing the pressure at gate 1, which is diagonally
opposite to the disturbance. However, this does not prevent
the flow front from progressing significantly faster on the
upper half of the image. After achieving quite good rewards
in the beginning, the cumulative reward drops deeply due to
the irregularity of the flow front.
The comparison shows that agent PPO/FfFvcP is able to
act far-sighted due to its additional information and accepts
0 1 2 3 4 5 6 78 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Time in s
Pressure in bar
PPO/Ff, Inlet 1
PPO/Ff, Inlet 2
PPO/Ff, Inlet 3
PPO/FfFvcP, Inlet 1
PPO/FfFvcP, Inlet 2
PPO/FfFvcP, Inlet 3
t=7s t=14s
t=7s t=14s
0 1 2 3 4 5 6 78 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Time in s
Accumulated Rewards
Fig. 3. The comparison of the control strategies of agents PPO/Ff and PPO/FfFvcP, using three snapshots. The local FVC of the textile can be seen transparently
in the flow front images, with the insert with lower FVC being darker. In addition, the evolution of the cumulative rewards up to the respective time point,
as well as the evolution of the pressure values at the individual gates representing the agents’ actions, were visualized.
suboptimal results in the beginning in order to achieve better
ones in the long run. Agent PPO/Ff, on the other hand,
makes optimal decisions at the beginning, but since it does
not know where the insert is located, it falls behind in terms
of cumulative rewards after only a few steps.
In conclusion, it is possible for RL agents to learn strategies
that perform better than an uncontrolled injection in terms of
the reward function and in the presence of minor disturbances.
The more information is available to the agent, the better
its strategy. Especially the FVC map makes a big difference
because it allows the agent to steer preemptively against future
B. Strong perturbations
In this series of experiments, the insert placement and
the level of permeability perturbation cause either strong
irregularities of the flow front or even result in the formation
of a dry spot. During an uncontrolled resin injection, 54% of
the test cases were terminated early due to the detection of
a dry spot. First, general observations from this experimental
campaign are discussed and subsequently, one example run is
Table III lists the results obtained when training the six
pairings of algorithms and observation spaces. The metric of
average cumulative rewards per episode is of limited use. First,
the rewards obtained are generally lower than in the first exper-
iment, which can be explained by the stronger perturbations of
the flow front. Second, within this experiment, the PPO agents
can generate higher rewards than the A2C agents, but in doing
so there appears to be no relationship to the rate of dry spots.
Third, the A2C agents receive significantly lower rewards on
average, in particular even less than the uncontrolled injection.
Nonetheless, agent A2C/FfFvcP is the most successful in terms
of preventing dry spots.
Also, the mean length cannot be used unambiguously to
compare strategies. The uncontrolled process requires the
fewest steps per episode, this is because over half of the
test cases are terminated prematurely, which usually occurs
between step 10 and step 15. The conclusion that longer
episodes indicate better agent behavior is also incorrect. Agent
A2C/FfFvc stands out, producing comparatively long episodes
but still having a similar dry spot rate as, e.g., PPO/FfFvc,
whose episodes are on average about 9steps shorter. Thus,
the episodes of A2C/FfFvc are not longer because it prevents
more dry spots and thus leads more episodes to successful
completion, but because the strategy of A2C/FfFvc chooses
comparatively low-pressure values, causing the flow front to
progress slower and the episodes to last longer. This has
been found in graphical evaluations of the strategy but is
not explicitly shown here by figures for reasons of space and
relevance. Therefore, Table III reports the mean length cleared
of episodes that have been stopped early as adjusted mean
Due to the occurrence of dry spots, the dry spot rate is used
as a metric in this experiment. This has the advantage that
preventing dry spots corresponds to a real advantage, while
/ Observa-
tion Space
Reward per
Adjusted Mean
Uncontrolled 157.421.15 54% 24.91
PPO/Ff 163.9 23.72 40% 29.27
A2C/Ff 322.8 29.27 41% 35.76
PPO/FfFvc 132.820.82 37% 24.95
A2C/FfFvc 275.1 29.37 39% 35.61
PPO/FfFvcP 138.5 22.67 34% 26.64
A2C/FfFvcP 221.5 23.82 27% 26.82
the rewards obtained are subject to the assumptions of the
reward function.
All agents achieve a better dry spot rate than the uncon-
trolled process. The trend is that, as in the first series of
experiments, agents with more information can achieve better
results, which means lower dry spot rates in this case. In
contrast to the first series of experiments, the best agents
do not show major disadvantages in terms of filling speed,
yet the uncontrolled process is still the fastest. This becomes
apparent when considering the dry spot rate and the adjusted
mean length together, which shows that agents that can prevent
the most dry spots still have a comparatively high filling
speed. There is no clear trend between the two algorithms,
in particular, an agent trained by A2C performs best.
Fig. 4 shows a comparison of the best agent A2C/FfFvcP
and the uncontrolled injection, in which the latter leads to the
formation of a dry spot.
The actions of the uncontrolled agent are plotted in the
action graph for comparison, with the maximum value of
5bar applied to all inlets at all times. The learned strategy
also behaves almost constantly. At inlet 2, which is on the
horizontal line of the insert, 5bar is applied, while the value
at the two outer inlets is reduced to half: 2.5bar. Due to the
overall lower injection pressure, the regulated flow front moves
slightly slower than the uncontrolled one. In the snapshots,
although both flow fronts are approximately equally advanced,
there are 1,5s between each. The uncontrolled flow front
advances rapidly at the edges of the component so that the part
delayed by the lower permeability of the insert lags behind.
As soon as the advancing arms of the flow front have passed
the insert, they close up again because the textile has a higher
permeability there and is penetrated more quickly. The delayed
region of the flow front is so far behind that the insert is not
completely impregnated before the flow front closes behind
it. Thus, air entrapment occurs and the episode is aborted.
This can also be seen in the reward graph in Fig. 4, as a
strong negative reward signal is triggered after 6,5s. The
controlled flow front moves slower, especially at the edges
of the component. Although a slight advance on both sides of
the insert cannot be prevented, the resin has almost completely
penetrated the insert when the two arms merge, as can be seen
in the snapshot after 6,5s. Thus, a dry spot does not form
and the filling process can be completed.
0 1 2 3 4 5 6 78 9 10 11 12
Time in s
Pressure in bar
Comp., Inlet 1
Comp., Inlet 2
Comp., Inlet 3
A2C/FfFvcP, Inlet 1
A2C/FfFvcP, Inlet 2
A2C/FfFvcP, Inlet 3
t = 3.5 s
t=5s t=6.5s
t=6.5s t=8s
0 1 2 3 4 5 6 78 9 10 11 12
Time in s
Accumulated Rewards
Fig. 4. Snapshots from a filling cycle with strong perturbations controlled by A2C/FfFvcP. An uncontrolled injection (Comp.) is depicted as a comparison.
In this experiment, it could be shown that RL algorithms
can learn control strategies that reduce the dry spot rate of the
simulated RTM process. In doing so, they do not necessarily
have a disadvantage in filling speed. When provided with
more information, the agents can preemptively steer against
perturbations and thus achieve better results.
In this work, we showed that RL for the RTM process is pos-
sible and yields better results than an uncontrolled or steadily
parameterized process. Another advantage of the presented
approach is that a mathematical model of the process is not
needed, which is the case for MPC. Through massive parallel
computation, we made it possible to train our RL models in
appropriate timing boundaries. We adapted and used a finite-
volume simulation that had all the necessary properties for
RL, such as interrupting and re-parameterizing processes on
the fly, which is not the case for most commercial simulation
software. Our approach is currently constrained to a subset of
RTM processes and can be adjusted to other setups. For that,
the simulation of the process needs to be adapted and possibly
the reward function needs to be adjusted, depending on the
form and other properties of the product. For an application
in the real world, with an RTM machine, several steps would
need to be taken. If a component of the same shape and size
we discussed in the paper would be the goal, a machine with
inlet gates that can be adjusted during the process and also a
monitoring system, that shows flow front (e.g. as shown by
Stieber et al. [14]) and pressure field of the process would be
necessary. Another way to apply this method to a real-world
process would be to use a Vacuum Assisted Resin Infusion
(VARI) process that usually works with a transparent vacuum
bag as the top half of the mold. Here, the flow front is always
visible and thus needs not be obtained through sensors. For
the pressure field, pressure sensors would be necessary in both
cases. After determining the process to use, a model trained
with a matching simulation could be used with our method and
then be re-trained to the real process, making this a Sim-to-
Real Transfer Learning [28] approach. Reshaping the reward
function would be necessary for most new applications of
this method. E.g. other geometries with desired flow fronts
of different shape. Additionally, effects such as race-tracking
[16], [29], that happen in real-world scenarios, need to be
considered to adjust the flow-front part of the reward function
for real processes. Another aspect that could be presented in
future work is the use of additional actuators, such as vents,
which could yield even better results through a wider range
of actuation of the process.
[1] Handbuch Faserverbundkunststoffe/Composites: Grundlagen, Verar-
beitung, Anwendungen, 4th ed., ser. Springer eBook Collection Com-
puter Science and Engineering. Wiesbaden: Springer Vieweg, 2013.
[2] David A. Babb, W. Frank Richey, Katherine Clement, Edward R.
Peterson, Alvin P. Kennedy, Zdravko Jezic, Larry D. Bratton, Eckel
Lan, Donald J. Perettie, “Resin transfer molding process for composites,”
Patent US5 730 922A.
[3] K.-H. Wang, Y.-C. Chuang, T.-H. Chiu, and Y. Yao, “Flow pattern
control in resin transfer molding using a model predictive control
strategy, Polymer Engineering & Science, vol. 58, no. 9, pp. 1659–
1665, 2018.
[4] H. H. Demirci and J. P. Coulter, “Neural network based control of
molding processes,” Journal of Materials Processing and Manufacturing
Science, vol. 2, no. 3, pp. 335–354, 1994.
[5] D. Nielsen and R. Pitchumani, “Real time model-predictive control
of preform permeation in liquid composite molding processes,” in
Proceedings of NHTC’00, 2000. [Online]. Available: htm files/c0003.pdf
[6] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc-
tion,” IEEE Transactions on Neural Networks, vol. 9, no. 5, p. 1054,
[7] M. Szarski and S. Chauhan, “Instant flow distribution network optimiza-
tion in liquid composite molding using deep reinforcement learning,”
Journal of Intelligent Manufacturing, vol. 34, no. 1, pp. 197–218, 2023.
[8] John A. Nelder and Roger Mead, “A simplex method for function
minimization,” Computer Journal, vol. 7, pp. 308–313, 1965.
[9] H. H. Demirci and J. P. Coulter, “A comparative study of nonlinear
optimization and Taguchi methods applied to the intelligent control of
manufacturing processes,” Journal of Intelligent Manufacturing, vol. 7,
no. 1, pp. 23–38, 1996.
[10] D. R. Nielsen and R. Pitchumani, “Control of flow in resin transfer
molding with real-time preform permeability estimation,” Polymer Com-
posites, vol. 23, no. 6, pp. 1087–1110, 2002.
[11] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by
simulated annealing,” SCIENCE, vol. 220, no. 4598, pp. 671–680, 1983.
[12] D. Nielsen and R. Pitchumani, “Closed-loop flow control in resin transfer
molding using real-time numerical process simulations,” Composites
Science and Technology, vol. 62, no. 2, pp. 283–298, 2002.
[13] S. Stieber, A. Hoffmann, A. Schiendorfer, W. Reif, M. Beyrle, J. Faber,
M. Richter, and M. Sause, “Towards Real-time Process Monitoring and
Machine Learning for Manufacturing Composite Structures,” in 2020
25th IEEE International Conference on Emerging Technologies and
Factory Automation (ETFA), vol. 1, 2020, pp. 1455–1458.
[14] S. Stieber, N. Schr¨
oter, A. Schiendorfer, A. Hoffmann, and W. Reif,
“FlowFrontNet: Improving Carbon Composite Manufacturing with
CNNs,” in Machine Learning and Knowledge Discovery in Databases:
Applied Data Science Track, ser. Lecture Notes in Computer Science,
Y. Dong, D. Mladeni´
c, and C. Saunders, Eds. Cham: Springer
International Publishing, 2021, vol. 12460, pp. 411–426.
[15] S. Stieber, N. Schroter, E. Fauster, A. Schiendorfer, and W. Reif,
“PermeabilityNets: Comparing Neural Network Architectures on a
Sequence-to-Instance Task in CFRP Manufacturing, in 2021 20th
IEEE International Conference on Machine Learning and Applications
(ICMLA). IEEE, 2021, pp. 694–697. [Online]. Available: https:
[16] S. Stieber, N. Schr¨
oter, E. Fauster, M. Bender, A. Schiendorfer, and
W. Reif, Inferring material properties from CFRP processes via Sim-to-
Real learning, 2022.
[17] C. Gonz´
alez and J. Fern´
on, “A Machine Learning Model
to Detect Flow Disturbances during Manufacturing of Composites by
Liquid Moulding,” Journal of Composites Science, vol. 4, no. 2, p. 71,
[18] H. Gr¨
ossing, N. Stadlmajer, E. Fauster, M. Fleischmann, and R. Schled-
jewski, “Flow front advancement during composite processing: predic-
tions from numerical filling simulation tools in comparison with real-
world experiments, Polymer Composites, vol. 37, no. 9, pp. 2782–2793,
[19] V. Konda and J. Tsitsiklis, “Actor-Critic Algorithms,” in
Advances in Neural Information Processing Systems, S. Solla, T.
Leen, and K. M¨
uller, Eds., vol. 12. MIT Press, 1999.
[Online]. Available:
[20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep
Reinforcement Learning.” [Online]. Available:
[21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal Policy Optimization Algorithms.”
[22] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust
Region Policy Optimization. [Online]. Available:
[23] C. Obertscheider and E. Fauster, “Rtmsim - a julia module for
filling simulations in resin transfer moulding,”
obertscheiderfhwn/RTMsim, 2022.
[24] ESI Group, “Composites Simulation Software,” 01.08.2022. [Online].
[25] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto,
Maximilian Ernestus, and Noah Dormann, “Stable-Baselines3:
Reliable Reinforcement Learning Implementations,” Journal of Machine
Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available:
[26] A. F. Agarap, “Deep Learning using Rectified Linear Units (ReLU).
[Online]. Available:
[27] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.”
[28] S. Stieber, “Transfer Learning for Optimization of Carbon
Fiber Reinforced Polymer Production,” Organic Computing:
Doctoral Dissertation Colloquium 2018, pp. 1–12, 2018.
[Online]. Available:
QPtHaPsUDKnmVMqOK9OVu0- 9jas#v=onepage&q&f=false
[29] S. Bickerton and S. G. Advani, “Characterization and modeling of race-
tracking in liquidcomposite molding processes,” Composites Science and
Technology, vol. 59, no. 15, pp. 2215–2229, 11 1999.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Carbon fibre reinforced plastic (CFRP) manufacturing cycle time is a major driver of production rate and cost for aerospace manufacturers. In vacuum assisted resin transfer molding (VARTM) where liquid thermoset resin is infused into dry carbon reinforcement under vacuum pressure, the design of a resin distribution network to minimize fill time while ensuring the preform is completely full of resin is critical to achieving acceptable quality and cycle time. Complex resin distribution networks in aerospace composites increase the need for quick, optimized virtual design feedback. Framing the problem flow media placement in terms of reinforcement learning, we train a deep neural network agent using a 3D Finite Element based process model of resin flow in dry carbon preforms. Our agent learns to place flow media on thin laminates in order to avoid resin starvation and reduce total infusion time. Due to the knowledge the agent has gained during training on a variety of thin laminate geometries, when presented with a new thin laminate geometry it is able to propose a good flow media layout in less than a second. On a realistic aerospace part with a complex 12-dimensional flow media network, we demonstrate our method reduces fill time by 32% when compared to an expert designed placement, while maintaining the same fill quality.
Full-text available
In this work, a supervised machine learning (ML) model was developed to detect flow disturbances caused by the presence of a dissimilar material region in liquid moulding manufacturing of composites. The machine learning model was designed to predict the position, size and relative permeability of an embedded rectangular dissimilar material region through use of only the signals corresponding to an array of pressure sensors evenly distributed on the mould surface. The burden of experimental tests required to train in an efficient manner such predictive models is so high that favours its substitution with synthetically-generated simulation datasets. A regression model based on the use of convolutional neural networks (CNN) was developed and trained with data generated from mould-filling simulations carried out through use of OpenFoam as numerical solver. The evolution of the pressure sensors through the filling time was stored and used as grey-level images containing information regarding the pressure, the sensor location within the mould and filling time. The trained CNN model was able to recognise the presence of a dissimilar material region from the data used as inputs, meeting accuracy expectation in terms of detection. The purpose of this work was to establish a general framework for fully-synthetic-trained machine learning models to address the occurrence of manufacturing disturbances without placing emphasis on its performance, robustness and optimization. Accuracy and model robustness were also addressed in the paper. The effect of noise signals, pressure sensor network size, presence of different shape dissimilar regions, among others, were analysed in detail. The ability of ML models to examine and overcome complex physical and engineering problems such as defects produced during manufacturing of materials and parts is particularly innovative and highly aligned with Industry 4.0 concepts.
Full-text available
The main problem that keeps many areas of research from using Deep Learning methods is the lack of sufficient amounts of data. We propose transfer learning from simulated data as a solution to that issue. In this work we present the industrial use case for which we plan to apply our transfer learning approach to: The production of economic Carbon Fiber Reinforced Polymer components. It is currently common to draw samples of produced components statistically and perform a destructive test on them, which is very costly. Our goal is to predict the quality of components during the production process. This has the advantage of enabling not only on-line monitoring but also adaptively optimizing the manufacturing procedure. The data comes from sensors embedded in a Resin Transfer Molding press.
Full-text available
We introduce the use of rectified linear units (ReLU) as the classification function in a deep neural network (DNN). Conventionally, ReLU is used as an activation function in DNNs, with Softmax function as their classification function. However, there have been several studies on using a classification function other than Softmax, and this study is an addition to those. We accomplish this by taking the activation of the penultimate layer $h_{n - 1}$ in a neural network, then multiply it by weight parameters $\theta$ to get the raw scores $o_{i}$. Afterwards, we threshold the raw scores $o_{i}$ by $0$, i.e. $f(o) = \max(0, o_{i})$, where $f(o)$ is the ReLU function. We provide class predictions $\hat{y}$ through argmax function, i.e. argmax $f(x)$.
Stable-Baselines3 provides open-source implementations of deep reinforcement learning (RL) algorithms in Python. The implementations have been benchmarked against reference codebases, and automated unit tests cover 95% of the code. The algorithms follow a consistent interface and are accompanied by extensive documentation, making it simple to train and compare different RL algorithms. Our documentation, examples, and source-code are available at
Carbon fiber reinforced polymers (CFRP) are light yet strong composite materials designed to reduce the weight of aerospace or automotive components – contributing to reduced emissions. Resin transfer molding (RTM) is a manufacturing process for CFRP that can be scaled up to industrial-sized production. It is prone to errors such as voids or dry spots, resulting in high rejection rates and costs. At runtime, only limited in-process information can be made available for diagnostic insight via a grid of pressure sensors. We propose FlowFrontNet, a deep learning approach to enhance the in-situ process perspective by learning a mapping from sensors to flow front “images” (using upscaling layers), to capture spatial irregularities in the flow front to predict dry spots (using convolutional layers). On simulated data of 6 million single time steps resulting from 36k injection processes, we achieve a time step accuracy of 91.7% when using a \(38 \times 30\) sensor grid 1 cm sensor distance in x- and y-direction. On a sensor grid of \(10 \times 8\), with a sensor distance of 4 cm, we achieve 83.7% accuracy. In both settings, FlowFrontNet provides a significant advantage over direct end-to-end learning models.
Resin transfer molding (RTM) is an efficient manufacturing process for fabricating polymer composites, in which liquid thermosetting resin is injected into a closed mold to saturate a fiber preform. In RTM, effective flow control is necessary to direct the resin to flow in the desired manner and to prevent the formation of defects. Most existing methods are based on numerical flow simulations, whose accuracy is directly tied to the fidelity of the physics and material models used in the codes. The control performance of these methods largely depends on the quality of the models. The traditional proportional–integral–differential controllers are unsuitable as well, because of the nonlinear and time-varying characteristics of the RTM system. In this research, a model predictive control strategy is proposed for adjusting the flow behavior of the resin inside the mold, and it does not rely on process simulators. Recursive least squares with an adaptive directional forgetting factor is adopted as a method to identify the input–output relationship of the process under control. Based on the identification results, both the flow velocity and the flow front profile can be controlled simultaneously. The feasibility of the proposed strategy are illustrated with experimental results. POLYM. ENG. SCI., 2017. © 2017 Society of Plastics Engineers
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.