Content uploaded by Kazutoshi Tanaka

Author content

All content in this area was uploaded by Kazutoshi Tanaka on Oct 19, 2021

Content may be subject to copyright.

Iterative Backpropagation Disturbance Observer

with Forward Dynamics Model

Takayuki Murooka#1, Masashi Hamaya2, Felix von Drigalski2, Kazutoshi Tanaka2, and Yoshihisa Ijiri2,3

Abstract— Disturbance Observer (DOB) has been widely

used for robotic applications to eliminate various kinds of

disturbances. Recently, learning-based DOB has attracted sig-

niﬁcant attention as it can deal with complex robotic systems.

In this study, we propose the Iterative Backpropagation Distur-

bance Observer (IB-DOB) method. IB-DOB learns the forward

model with a neural network, and calculates disturbances

via iterative backpropagations, which behaves like the inverse

model. Our method can not only improve estimation perfor-

mances owing to the iterative calculation but also be applied

to both model-free and -based learning control. We conducted

experiments for two manipulation tasks: the cart pole with

Deep Deterministic Policy Gradient (DDPG) and the pushing

object task with Deep Model Predictive Control (DeepMPC).

Our method demonstrated better task performances than the

baselines without DOB and with DOB using a learned inverse

model even though disturbances of external forces and model

errors were provided.

I. INTRODUCTION

For real-world robotic applications in uncertain environ-

ments containing unexpected disturbances, studies for ro-

bust control methods have been addressed for the last four

decades [1]. Distance Observer (DOB) is one of the most

popular robust control methods owing to its simplicity [1],

[2]. It has also been extended to deal with nonlinear sys-

tems [3], resulting in wide applications to robotic scenarios.

However, most of them required identiﬁcation for dynamics

models manually designed by a rational polynomial. As the

real-world environments contain many unmodeled effects,

the existing methods might suffer from modeling errors.

For robotic manipulation that has complex dynamics,

learning approaches attract signiﬁcant attention because of

their high expressiveness using deep neural networks [4]. We

are interested in leveraging both advantages for a simple and

high expressive structure by combining DOB and a learning-

based approach. Several previous approaches have learned

inverse dynamics models with neural networks to estimate

the disturbances [5], [6].

Meanwhile, in this study, we propose Iterative Backpropa-

gation Disturbance Observer (IB-DOB), which approximates

the forward dynamics model with a deep neural network as

#Work done at OMRON SINIC X Corp. as a part of an internship.

1Takayuki Murooka is with the Department of Mechano-Informatics, The

University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan.

t-murooka@jsk.imi.i.u-tokyo.ac.jp

2Masashi Hamaya, Felix von Drigalski, Kazutoshi Tanaka, and Yoshi-

hisa Ijiri are with OMRON SINIC X Corporation, Hongo 5-24-5,

Bunkyo-ku, Tokyo, Japan. {masashi.hamaya, f.drigalski,

kazutoshi.tanaka, yoshihisa.ijiri}@sinicx.com

3Yoshihisa Ijiri is with OMRON Corporation, Konan 2-3-13, Minato-ku,

Tokyo, Japan. yoshihisa.ijiri@omron.com

Policy

)

Dynamics

, )

IB-DOB

Train

Model Error

External Force Sim-to-Real Gap

Disturbance Compensation

…

…

…

…

…

Fig. 1. Overview of proposed method of IB-DOB. We propose using a

learned forward model with neural networks to estimated disturbances.

shown in Fig. 1. It estimates disturbances from the errors

occurring between the predicted and observed states with an

optimization-based method using iterative backpropagations

of the forward model. This idea is motivated by the recent

learning-based manipulation approaches [7], [8], [9] using

iterative backpropagations to obtain optimal actions. We

apply this algorithm to disturbance estimation.

Compared to the previous methods using learned inverse

dynamics models, our method can be more practical for the

following two reasons. First, as our method estimates the

disturbance via iterative calculation, the estimation errors can

be reduced than the existing ones, which performed single

calculation with the inverse models. Second, although our

method can be applied to both model-free and -based learn-

ing control, it is useful especially in model-based learning

settings because the forward model can be shared to estimate

the disturbance and behavior prediction.

We applied IB-DOB to two learning-based approaches:

Deep Deterministic Policy Gradient (DDPG) [10], one of

the most popular off-policy model-free deep reinforcement

learning algorithms and Deep Model Predictive Control

(DeepMPC) [7], a supervised learning-based model pre-

dictive control which has been applied to some robotic

tasks [8] [9] [11].

We prepared some experimental environments with some

kinds of disturbances to evaluate our method. First, we

applied disturbances of external forces and model errors to

cart pole and pushing object tasks in simulation. Besides, in

the pushing task, we applied the model learned in simulation

to the real-world environments to provide model errors.

Note that the modeling errors can also be considered as

disturbances [12]. In summary, the main contributions of this

study are as follows.

•We propose a novel learning-based method named as

IB-DOB. It approximates the forward dynamics model

with a deep neural network and estimates disturbances

by the errors between the predicted and observed states

via iterative backpropagations.

•We performed manipulation experiments on various

conditions with two types of environments and learning

methods while applying disturbances. Results demon-

strated that IB-DOB could successfully deal with dis-

turbances such as external forces and model errors.

The remainder of this paper is organized as follows.

Section II presents previous works related to our research.

We introduce our proposed method in Sections III. Section

IV describes the experiments. Section V discusses our results

and Section VI provides a conclusion.

II. REL ATED WORKS

A. Disturbance observer and its applications

DOB has been widely used in several ﬁelds such as pro-

cess and mechanical control, where uncertain disturbances

such as external forces or model errors must be removed to

achieve an accurate control [2]. A lot of the DOB’s variants

exist such as nonlinear DOB [3] and optimization-based

DOB [13] so that they can be applied to various kinds of

dynamics models.

Also in the ﬁeld of robotics, these kinds of DOBs have

been widely used as the robot model in the real world is

complex, and various types of modeling errors occur between

the analytical and the actual models. Chen et al. applied non-

linear DOB to the control of robotic manipulators, and that

it was reported that that DOB overcame the disadvantages

of previous DOBs such as designed or analyzed by linear

systems [14]. Eom et al. achieved force control of robotic

manipulators without force sensors using DOB [15]. They

applied DOB to the force estimation, and they successfully

estimated the external force and proved that DOB can be

an altenative to force/torque sensors. Huang et al. proposed

an adaptive controller for nonholonomic wheeled mobile

robot using DOB for external disturbances and unknown

parameters [16]. The other approaches have been used

for a quadrotor [17] and force estimation of cooperative

robots [18], [19] or assistive robots [20].

Unlike these methods, which employed manually designed

dynamics models, we aim at leveraging learning-based dy-

namics models to express more complex dynamics systems.

B. Learning-based Disturbance Observer

To compensate for disturbances with unknown and unmod-

eled effects in complicated robotic applications, learning-

based DOB approaches have attracted signiﬁcant attention.

Juan et al. proposed NNDOB representing DOB with

a neural network [5]. Its simplest form to calculate the

estimated disturbance

ˆ

dtcan be written as follows:

ˆut=f−1(xt,xt+1),(1a)

ˆ

dt=ˆut−ut,(1b)

where fis the forward dynamics model (xt+1 =f(xt,ut)),

utis the action, ˆutis the estimated action, and xtis the state.

NNDOB has been extended in various ways, and RBFN-

DOB was proposed to use radial basis function network

instead of a normal neural network [6]. Other learning-based

DOBs also approximated a part of the inverse model includ-

ing the disturbance [21] or estimated the disturbance directly

as an output of the network [22], [23], [24]. They also extend

NNDOB or RBFNDOB to MNNDOB or MRBFNDOB,

which can deal with multi-input multi-output systems [25],

and they empirically proved that these DOBs overwhelmed

model-based MIMO linear disturbance observer.

In this study, we leverage the forward model instead of

the inverse model seen in previous studies.

III. IB-DOB: ITE RATIV E BACKP ROPAGATION

DISTURBANCE OBSE RVER

In this section, we explain IB-DOB, which is the learning-

based DOB utilizing the forward model referring to the DOB

framework in the control theory [2] and back-propagation-

based action optimization methods [7], [8], [9].

A. Training of Forward Model

First, we formulate a network to represent the forward

model of manipulation. The network input represents the

current state and action, and the output represents the next

state. Let xtand utbe the state and action at time step t,

respectively; then, the network of the forward model fis

formulated as follows with the network weight θ.

xt+1 =f(xt,ut;θ)(2)

By collecting a dataset of xt+1,xt, and ut, we train the

forward model the similar way to Stochastic Gradient De-

scent(SGD) [26], which is one of simplest network training

algorithms with the learning rate f.

θ←θ−f∂J (xdata

t+1 ,xpred

t+1 )/∂θ(3a)

J(xdata

t+1 ,xpred

t+1 ) = 1

2kxdata

t+1 −xpred

t+1 k2(3b)

Jis the loss function for training the network, and xpred

and xdata are the predicted and dataset states, respectively.

We can easily calculate ∂J/∂θwith backpropagations and

chain rules of neural networks.

B. Disturbance Estimation

The aim of IB-DOB is to estimate the disturbance dtusing

the forward model described as a neural network. This is

formulated as the following optimization problem.

ˆut= argmin

ut

J(xobs

t+1,xpred

t+1 )(4a)

s.t. J(xobs

t+1,xpred

t+1 ) = 1

2kxobs

t+1 −xpred

t+1 k2(4b)

…

…

…

…

IB-DOB

s.t.

)

)

Forward Model

Policy

)

Dynamics

, )

IB-DOB

Fig. 2. IB-DOB with policy of learning-based approaches. The disturbances

are estimated via iterative backpropagation with the forward model.

Jis the cost function to estimate the action ˆu, and xpred and

xobs are the predicted and observed states, respectively. This

calculation is conducted at time step t+ 1 after observing

xobs

t+1. The formulation represents a nonlinear optimization

problem, thus requiring a prohibitive amount of time to

solve accurately. Therefore, we try to solve this problem

approximately by using sequential optimization with an

update rate uas follows.

g=∂J (xobs

t+1,xpred

t+1 )/∂ut(5a)

ˆut←ˆut−ug/kgk2(5b)

ˆutis initialized by ut, and k ∗ k2represents the L2

norm. ∂J (xobs

t+1,xpred

t+1 )/∂utcan be calculated by the chain

rule of the backpropagation of the network with the loss

J(xobs

t+1,xpred

t+1 ). The calculation process of the forward prop-

agation to predict xpred

t+1 and the update of the action with the

gradient of the cost function utilizing the backpropagation in

Eq. (5) are repeated several times so that the estimation of

the action can converge. Finally, the disturbance at time step

tis estimated as:

ˆ

dt←ˆut−(ut−kd

ˆ

dt−1),(6)

where kdis a coefﬁcient for the smoothing ﬁlter to avoid

noisy estimation.

C. IB-DOB with Learning-based Approaches

IB-DOB is calculated independent of the action so that

it can be applied to any learning-based policy obtained

from model-free or -based learning control. Let π(xt)be

the policy of learning-based approaches; then, the action is

calculated as follows.

ut=π(xt)(7)

Then, we update the action with the estimated disturbance

as follows.

ut←ut−kd

ˆ

dt−1(8)

The calculation process is described in Fig. 2, and we

organize the whole process of IB-DOB with a learning-based

Algorithm 1 Calculation of Action and Disturbance

Require: Policy π(x)e. g. learned by DDPG, DeepMPC

(xobs

t+1,xobs

t,ut)←observe

ˆut←utInitialize ˆut

for k= 1,2, . . . , Nu

update do

xpred

t+1 ←f(xobs

t,ˆut)

ˆut←UP DATECO NTRO LINPU T(xobs

t+1,xpred

t+1 ,ˆut)Eq.5

end for

ˆ

dt←ˆut−(ut−kd

ˆ

dt−1)

ut+1 ←π(xobs

t+1)

ut+1 ←ut+1 −kd

ˆ

dt

policy in Algorithm. 1. Nu

update is the number of times to

repeat the forward and backpropagation to estimate the action

within a control cycle.

IV. EXP ERI MEN TS

To validate our method, we performed simulation and

real-robot experiments. The goal of the experiments was to

investigate whether our method could estimate the distur-

bances more accurately, leading to better performances than

the baselines without DOB or using DOB learned with the

inverse model. We tackled cart-pole and pushing objects task

simulations. Then, we tested the learned model in simulation

with real-world environments for the pushing objects task.

A. Cart Pole

We conducted experiments on the cart-pole task in the

simulation using CartPole-v0 in OpenAI [27]. We extended

it such that the action can be calculated continuously as

shown in Fig. 3 (a). The action represents the horizontal force

required to move the cart, and the state represents the position

and angle of the pole and their derivatives. First, we trained a

learning-based policy and the forward model simultaneously

with 1000 episodes; then, we tested the performance of

the cart-pole stabilization. For the learning-based policy,

we adopted DDPG [10] as it is a widely used method for

reinforcement learning.

We compared three types of methods:

1) DDPG with no disturbance compensation (w/o DOB),

2) DDPG with direct DOB with the inverse model as

formulated in Eq. (1) typically used in previous stud-

ies [5] [6] (w/ Direct DOB),

3) DDPG with IB-DOB (w/ IB-DOB).

We employed a DDPG implementation based on Py-

Torch [28] and extended it for DOB. We designed two fully-

connected layers of 50 units for both the policy and value

networks. ReLU activations were used after each hidden

layer, and the output layer of the value network consisted

of linear units, while the policy network consists of tanh

output units. The forward model of IB-DOB and the inverse

model of Direct DOB have three fully-connected layers

of 50 units with ReLU activations except for the output

layer. These three networks were trained with the ADAM

(a) Cart pole

Push

forward

Goal

𝒙

𝒚

(b) Pushing objects in simulation

Motion-capture

camera ×6

Marker×3

𝒙

𝒚

Push

forward

Goal

(c) Pushing objects in real world

Fig. 3. Experimental Setup.

TABLE I

DIS TURBA NCES O F EXTE RNAL F ORCE S FOR CA RT POLE

amplitude [N] width [s]

External Force1 15 1

External Force2 25 0.2

External Force3 20 0.6

TABLE II

DISTURBANCES OF MODEL ERRORS FOR CART POLE

cart mass [kg] pole mass [kg] pole length [m]

Model Error1 4.5 3.0 0.3

Model Error2 10.5 0.5 0.4

Model Error3 4.5 6.0 0.4

optimizer [29]. We set the number of iteration for the forward

and backpropagation Nu

update as 100.

We evaluated six types of disturbances: three for external

forces and three for model errors as shown in Table I and

Table II. To simulate external forces, we applied a pushing

force to the cart using a step function by changing its

magnitude and width. To simulate model errors, we changed

the cart mass, pole mass, and pole length.

We show the variation of the mean squared errors of the

pole angle over time in Table III and also graphs of the state

error, disturbance, and estimated disturbance in Fig. 4. In

almost all cases, w/ IB-DOB was seen to perform favorably

compared to other methods.

B. Pushing Objects in Simulation

We conducted simulation experiments on the object-

pushing task using Gazebo [30] as shown in Fig. 3 (b). The

action urepresents the position to push the edge of the object

which is normalized from 0 to 1, while the state represents

the two-dimensional pose of the object (x, y, ϕ)as shown in

Fig. 5 (a). The aim of this task is to push the object forward

until the xposition of the object reaches the goal. First,

we collected the dataset of {(xτ+1,xτ,uτ)}τ, and trained a

learning-based policy and the forward model simultaneously;

then, we conducted simulations on pushing objects. For the

learning-based policy, we adopted an effective supervised-

TABLE III

MEA N SQUAR ED ERRO R BET WE EN R EF ER EN CE A ND O BS ERV ED S TATE

FO R CA RT PO LE E XP ER IM EN TS (×10−2) .

Method w/o DOB w/ Direct DOB w/ IB-DOB

Origin 0.30 0.29 0.30

External Force1 33.30 1.49 1.49

External Force2 130.46 79.16 45.09

External Force3 94.96 8.11 5.74

Model Error1 1.48 1.24 1.14

Model Error2 2.46 1.68 1.59

Model Error3 1.90 1.33 1.29

learning approach, DeepMPC [7]. We calculated the optimal

action by minimizing the cost function, which was the differ-

ence between the next predicted state xpred

t+1 and its reference

state xref

t+1. This optimization problem can be formulated as

follows:

min

ut

J(xref

t+1,xpred

t+1 )(9a)

s.t.xpred

t+1 =f(xt,ut)(9b)

J(xref

t+1,xpred

t+1 ) = 1

2kxref

t+1 −xpred

t+1 k2.(9c)

This was also obtained by the iterative forward and back-

propagation like Eq. 5. The number of iterations was 50.

For comparison, we used three types of methods (w/o

DOB,w/ Direct DOB, and w/ IB-DOB) similar to the cart-

pole experiments.

To implement DeepMPC, we used the forward model

which consists of two fully-connected layers of 50 units with

ReLU activations except for the output layer. This forward

model is also used for IB-DOB. We set the number of

iteration for the forward and backpropagation Nu

update as

30 so that the calculation time did not exceed the control

period.

We evaluated four types of disturbances: two for external

forces and two for model errors. To simulate external forces,

we pushed the object during control by adding a random

torque and force. To simulate model errors, we changed

the center of gravity and shape of the object. The details

of the training and test objects are shown in Fig. 5 (b).

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

10

State Error

State Error

Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

10

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

10

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

w/ DeepDOB

w/o DOB

w/ Direct DOB

IB-DOB

(a) External force2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

State Error

State Error

Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0

5

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

w/ DeepDOB

w/o DOB

w/ Direct DOB

IB-DOB

(b) External force3

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0.00

0.05

0.10

State Error

State Error

Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0.00

0.05

0.10

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

0.0 0.5 1.0 1.5 2.0 2.5 3.0

time [s]

0.00

0.05

0.10

State Error

State Error

Estimated Disturbance

0

20

40

Disturbance

w/ DeepDOB

w/o DOB

w/ Direct DOB

IB-DOB

(c) Model error3

Fig. 4. Graph of state error, disturbance and estimated disturbance on cart-pole experiments. Our method showed smaller state errors than the baselines

without DOB and with Direct DOB in most of the disturbance conditions.

𝒙

𝒚

(𝒙, 𝒚, 𝝋)

(a) Detail of

object pushing

Model Error1

Model Error2

Origin

(b) Objects in

simulation

Model Error1

Model Error2

Origin

Model Error3

(c) Objects in real world

Fig. 5. Detail of pushing objects. The black and white dots are the center

of gravity. The red arrow describes the pushing action.

TABLE IV

MEA N SQ UAR ED E RRO R B ET WE EN R EF ER EN CE A ND O BS ERV ED S TATE

FOR PUSHING OBJECTS IN SIMULATION (×10−4).

Method w/o DOB w/ Direct DOB w/ IB-DOB

Origin 1.74 ±0.18 1.21 ±0.74 1.03 ±0.35

External Force1 1.68 ±0.16 1.06 ±0.56 0.686 ±0.19

External Force2 71.26 ±20.61 70.46 ±43.56 58.37 ±22.83

Model Error1 68.67 ±37.08 12.54 ±5.97 8.03 ±5.36

Model Error2 4.35 ±0.60 3.78 ±1.62 1.60 ±0.80

First, we trained our model with Origin object. Then, we

tested with Origin, External Force1 (random force), External

Force2 (random torque) conditions, Model Error1 (shifted the

center of mass), and Model Error2 (different shape) objects.

We show the mean squared errors and the standard de-

viation of the object angle ϕfrom the reference value (0

degree) among ﬁve trials in Table IV. The variance of the

errors occurred likely due to the latency of heavy calculation

processes of Gazebo simulator. Under all conditions, w/ IB-

DOB performed favorably compared to other methods.

C. Pushing Objects in the Real World

Finally, we conducted experiments on the object-pushing

task with a robot in the real world. The detail of object-

pushing and its policy, the forward model, comparative

methods, and its calculation process were the same as those

of the object-pushing simulation experiments. We learned the

forward model learned in the simulation and then tested it in

the real world. We exploited the model errors between the

TABLE V

MEA N SQ UAR ED E RRO R B ET WE EN R EF ER EN CE A ND O BS ERV ED S TATE

FO R PU SH IN G OB JE CT S IN R EA L WO RL D (×10−4).

Method w/o DOB w/ Direct DOB w/ IB-DOB

Origin 2.92 ±0.51 2.54 ±0.94 2.37 ±0.49

Model Error1 15.60 ±1.17 13.49 ±2.90 11.51 ±1.40

Model Error2 13.79 ±4.73 11.83 ±6.27 6.10 ±2.28

Model Error3 4.42 ±0.53 4.55 ±0.81 3.03 ±0.72

simulation and real-world as the disturbance since quantita-

tive evaluation for disturbances of external forces is difﬁcult

in the real-world. We used a motion-capture system to detect

the object pose as shown in Fig. 3 (c).

After training the policy and the forward model in simu-

lation similar to the object-pushing simulation experiments,

experiments were conducted on the pushing objects in the

real world using DeepMPC with or without the disturbance

compensation. In addition, we evaluated three types of dis-

turbances for model errors by changing the center of the

gravity of the object and the shape of the object as shown

in Fig. 5 (c).

The mean squared errors and the standard deviation of the

object angle ϕfrom the reference value (0 degrees) among

ﬁve trials are shown in Table V. Under all conditions, w/

IB-DOB performed favorably compared to other methods.

V. DISCUSSIONS

We demonstrated that IB-DOB estimated the disturbance

and increased the accuracy of the results of learning-based

approaches. It dealt with a wider range of disturbances such

as external forces, model errors than the previous studies.

In order to compare IB-DOB with other methods used in

previous studies, we evaluated direct DOB used in previous

studies [5] [6] as formulated in Eq. (1). As shown in the

experimental tables, in some cases, direct DOB performed

nearly as well as IB-DOB, although in other cases its

performance is not as favorable. In addition, we can see

that the standard deviation of the state error in direct DOB

is signiﬁcantly larger than that of IB-DOB. We interpret

this to suggest that direct DOB can sometimes estimate the

disturbance correctly, but not always.

Although our method demonstrated our effectiveness,

several challenges remain. First, although our method was

evaluated with several force disturbances such as constant

and step inputs, we need to test our method with noisier and

faster disturbances. Moreover, we will investigate what kind

of settings our method works better by increasing the number

of test environments. Second, we need to address other envi-

ronments that have higher dimensional systems. However, as

previous approaches using iterative backpropagation [7], [8],

[9] could perform complicated tasks, our method would work

well even with such complex systems. Although ensuring the

stability in the fully data-driven domain is one of the grand

challenges, we will explore how to analyze the robustness

and stability of our method.

VI. CONCLUSION

In this study, we proposed a learning-based disturbance

observer named as IB-DOB to eliminate disturbances such as

external forces, model errors in complex dynamics systems.

Instead of estimating disturbances with the inverse dynamics

model, it utilizes the forward model and iterative backprop-

agation. We evaluated our method in two manipulation tasks

with two learning-based approaches. Results demonstrated

that IB-DOB accurately estimated disturbances and achieved

more accurate manipulation than both conventional learning-

based DOB approaches using the inverse model, and methods

without any DOBs.

REFERENCES

[1] Emre Sariyildiz, Roberto Oboe, and Kouhei Ohnishi. Disturbance

observer-based robust control and its applications: 35th anniversary

overview. IEEE Transactions on Industrial Electronics, Vol. 67, No. 3,

pp. 2042–2053, 2019.

[2] Wen-Hua Chen, Jun Yang, Lei Guo, and Shihua Li. Disturbance-

observer-based control and related methods—an overview. IEEE

Transactions on Industrial Electronics, Vol. 63, No. 2, pp. 1083–1095,

2015.

[3] Xinkai Chen, Satoshi Komada, and Toshio Fukuda. Design of a

nonlinear disturbance observer. IEEE Transactions on Industrial

Electronics, Vol. 47, No. 2, pp. 429–437, 2000.

[4] Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor,

and Sergey Levine. How to train your robot with deep reinforcement

learning: lessons we have learned. The International Journal of

Robotics Research, 2021.

[5] Juan Li, Xisong Chen, Shihua Li, Jun Yang, and Jiancheng Zhou.

Nndob-based composite control for binary distillation column under

disturbances. In IEEE Chinese Control Conference, pp. 592–597,

2013.

[6] Juan Li, Shihua Li, and Xisong Chen. Adaptive speed control of a

pmsm servo system using an rbfn disturbance observer. Transactions

of the Institute of Measurement and Control, Vol. 34, No. 5, pp. 615–

626, 2012.

[7] Ian Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: Learning

deep latent features for model predictive control. In Robotics: Science

and Systems, 2015.

[8] Daisuke Tanaka, Solvi Arnold, and Kimitoshi Yamazaki. Emd net:

An encode–manipulate–decode network for cloth manipulation. IEEE

Robotics and Automation Letters, Vol. 3, No. 3, pp. 1771–1778, 2018.

[9] Takayuki Murooka, Kei Okada, and Masayuki Inaba. Diabolo

orientation stabilization by learning predictive model for unstable

unknown-dynamics juggling manipulation. In IEEE/RSJ International

Conference on Intelligent Robots and Systems, pp. 9174–9181, 2020.

[10] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas

Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.

Continuous control with deep reinforcement learning. arXiv preprint

arXiv:1509.02971, 2015.

[11] Kento Kawaharazuka, Kei Tsuzuki, Shogo Makino, Moritaka Onit-

suka, Koki Shinjo, Yuki Asano, Kei Okada, Koji Kawasaki, and

Masayuki Inaba. Task-speciﬁc self-body controller acquisition by mus-

culoskeletal humanoids: application to pedal control in autonomous

driving. In IEEE/RSJ International Conference on Intelligent Robots

and Systems, pp. 813–818, 2019.

[12] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta.

Robust adversarial reinforcement learning. In International Confer-

ence on Machine Learning, pp. 2817–2826, 2017.

[13] Addisu Tesfaye, Ho Seong Lee, and Masayoshi Tomizuka. A sensitiv-

ity optimization approach to design of a disturbance observer in digital

motion control systems. IEEE/ASME Transactions on mechatronics,

Vol. 5, No. 1, pp. 32–38, 2000.

[14] Wen-Hua Chen, Donald J Ballance, Peter J Gawthrop, and John

O’Reilly. A nonlinear disturbance observer for robotic manipulators.

IEEE Transactions on industrial Electronics, Vol. 47, No. 4, pp. 932–

938, 2000.

[15] Kwang Sik Eom, Il Hong Suh, Wan Kyun Chung, and S-R Oh.

Disturbance observer based force control of robot manipulator without

force sensor. In IEEE International Conference on Robotics and

Automation, Vol. 4, pp. 3012–3017, 1998.

[16] Dawei Huang, Junyong Zhai, Weiqing Ai, and Shumin Fei. Distur-

bance observer-based robust control for trajectory tracking of wheeled

mobile robots. Neurocomputing, Vol. 198, pp. 74–79, 2016.

[17] Wei Dong, Guo-Ying Gu, Xiangyang Zhu, and Han Ding. High-

performance trajectory tracking control of a quadrotor with disturbance

observer. Sensors and Actuators A: Physical, Vol. 211, pp. 67–77,

2014.

[18] Shirin Youseﬁzadeh and Thomas Bak. Nonlinear disturbance observer

for external force estimation in a cooperative robot. In IEEE Interna-

tional Conference on Advanced Robotics, pp. 220–226, 2019.

[19] Akiyuki Hasegawa, Hiroshi Fujimoto, and Taro Takahashi. Filtered

disturbance observer for high backdrivable robot joint. In Annual

Conference of the IEEE Industrial Electronics Society, pp. 5086–5091,

2018.

[20] Barkan Ugurlu, Masayoshi Nishimura, Kazuyuki Hyodo, Michihiro

Kawanishi, and Tatsuo Narikiyo. Proof of concept for robot-aided up-

per limb rehabilitation using disturbance observers. IEEE Transactions

on Human-Machine Systems, Vol. 45, No. 1, pp. 110–118, 2014.

[21] Junhai Huo, Tao Meng, and Zhonghe Jin. Adaptive attitude control

using neural network observer disturbance compensation technique.

In IEEE International Conference on Recent Advances in Space

Technologies, pp. 697–701, 2019.

[22] Haibin Sun and Lei Guo. Neural network-based dobc for a class of

nonlinear systems with unmatched disturbances. IEEE Transactions on

Neural Networks and Learning Systems, Vol. 28, No. 2, pp. 482–489,

2016.

[23] Bu Xuhui, Hou Zhongsheng, Yu Fashan, and Fu Ziyi. Model

free adaptive control with disturbance observer. Journal of Control

Engineering and Applied Informatics, Vol. 14, No. 4, pp. 42–49, 2012.

[24] Juan Li, Shihua Li, Shengquan Li, Xisong Chen, and Jun Yang.

Neural-network-based composite disturbance rejection control for a

distillation column. Transactions of the Institute of Measurement and

Control, Vol. 37, No. 9, pp. 1146–1158, 2015.

[25] Juan Li, Shihua Li, and Shengquan Li. The design of robust mimo

neural network disturbance observer for multi-variable system. In

IEEE Chinese Control Conference, pp. 2379–2383, 2014.

[26] L´

eon Bottou. Stochastic gradient descent tricks. In Neural networks:

Tricks of the trade, pp. 421–436. 2012.

[27] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,

John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv

preprint arXiv:1606.01540, 2016.

[28] Pytorch-ddpg-naf. https://github.com/ikostrikov/

pytorch-ddpg- naf.

[29] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic

optimization. arXiv preprint arXiv:1412.6980, 2014.

[30] Nathan Koenig and Andrew Howard. Design and use paradigms for

gazebo, an open-source multi-robot simulator. In IEEE International

Conference on Intelligent Robots and Systems, pp. 2149–2154, 2004.