PreprintPDF Available

Multimodal Failure Prediction for Vision-based Manipulation Tasks with Camera Faults

Authors:
  • Proximity Robotics & Automation GmbH

Abstract and Figures

Due to the increasing behavioral and structural complexity of robots, it is challenging to predict the execution outcome after error detection. Anomaly detection methods can help detect errors and prevent potential failures. However, not every fault leads to a failure due to the system's fault tolerance or unintended error masking. In practical applications, a robotic system should have a potential failure evaluation module to estimate the probability of failures when receiving an error alert. Subsequently, a decision-making mechanism should help to take the next action, e.g., terminate, degrade performance, or continue the execution of the task. This paper proposes a multimodal method for failure prediction for vision-based manipulation systems that suffer from potential camera faults. We inject faults into images (e.g., noise and blur) and observe manipulation failure scenarios (e.g., pick failure, place failure, and collision) that can occur during the task. Through extensive fault injection experiments, we created a FAULT-to-FAILURE dataset containing 4000 real-world manipulation samples. The dataset is subsequently used to train the failure predictor. Our approach processes the combination of RGB images, masked images, and planned paths to effectively evaluate whether a certain faulty image could potentially lead to a manipulation failure. Results demonstrate that the proposed method outper-forms state-of-the-art models in terms of overall performance, requires fewer sensors, and achieves faster inference speeds. The analytical software prototype and dataset are available at: Github: MultimodalFailurePrediction.
Content may be subject to copyright.
Multimodal Failure Prediction for Vision-based Manipulation Tasks
with Camera Faults
Yuliang Ma1, Jingyi Liu1, Ilshat Mamaev2, and Andrey Morozov1
Abstract Due to the increasing behavioral and structural
complexity, modern Industrial Cyber-Physical Systems (ICPS)
are getting more complex. This increases the likelihood of faults,
errors and failures. In addition, various environmental factors
also have an uncertain impact on the safety and reliability
of ICPS, which can cause economic losses and even hazardous
events. Fault detection can help ICPS prevent dangerous events
by predicting anomalies, but not every fault could lead to a
failure due to the fault tolerance mechanism of the system.
As such, it is not wise to shut down the system whenever an
anomaly is detected. In practical applications, users need to
know what consequence might happen when receiving a fault
alert. This provides guideline for users to take the next action
(terminate or continue the task). In this paper, we propose
a multimodal framework that can achieve proactive failure
detection for a vision-based manipulator with faulty images.
Specifically, we randomly inject different types of faults into
images (e.g., noise and blur) and observe failure scenarios (e.g.,
pick failure, place failure, and collision) that the system could
occur during a pick and place task. A FAULT-to-FAILURE
dataset is built via extensive fault injection experiments. Finally,
our multimodal framework is trained to proactively evaluate
whether a certain faulty image could lead to a manipulation
failure. Source: MultimodalFailurePrediction.
I. INTRODUCTION
Industrial Cyber-Physical Systems (ICPS) are playing an
important role in smart factories and manufacturing automa-
tion [1]. Dense hardware and software integration facilitates
the computation, communication, and control capabilities of
ICPS, enabling interaction between the cyber and physical
worlds. However, such extensive integration of these internal
components also significantly increase the behavioral and
structural complexity of ICPS. In addition, the complex
and varying environment also increases the uncertainty of
the physical world. Therefore, different types of faults or
errors are likely to occur in ICPS, and some of them
could lead to failures that may cause damage to systems
or even human. One promising way to prevent dangerous
events from occurrence is Deep Learning-based Anomaly
Detection [2] (DLAD). A typical example of DLAD for
time-series data is prediction-based methods. This approach
first uses currently observed sensor data (e.g., joint position,
This work is funded by the Ministry of Science, Research and Arts of
the Federal State of Baden-W¨
urttemberg for the financial support of the
projects within the InnovationCampus Future Mobility (ICM).
1The authors are with the Institute of Industrial Automation
and Software Engineering, University of Stuttgart, 70550,
Stuttgart, Germany. {yuliang.ma, jingyi.liu,
andrey.morozov}@ias.uni-stuttgart.de
2The author is with Karlsruhe University of Applied Sciences
and with Proximity Robotics & Automation GmbH, Germany.
mamaev@proximityrobotics.com
Correct
Service
Faulty
Images
Failure
Correct
Service
Path Planner,
Actuator,
etc.
Propagation
Er-
ror
Service
Interface
Object Detection
Module
Error
Internal
fault
External Fault
Propagation
Er-
ror
Input
Error
Service
Interface
Camera
Camera
Incorrect
Coordinates
Failure
Object Detection Module
Correct
Coordinates
Fault
Tolerance
Er-
ror
Path Planner, Actuator, etc.
Propagation
Fig. 1. The fault-error-failure chain adopted from [4]. In a typical vision-
based manipulation task, various components generate and deliver hetero-
geneous services (e.g., images, coordinates, and trajectories) to accomplish
the task. However, once an error occurs in one component, the manipulation
outcome is uncertain.
joint velocity, IMU data) to predict next value. Then, an
anomaly is detected if the residual between the predicted
value and current real value is over than a certain threshold.
However, fault tolerance mechanism allows ICPS to continue
performing their tasks normally even when a fault or an
error occurs. To this end, it is not a wise choice to terminate
the task whenever users receive a fault alert. According to
the definition of risk analysis in [3], risk analysis should
tackle three following questions:1) What can happen? 2)
How likely is the risk that will happen? 3) If it does happen,
what are the consequences? In this paper, we focus on the
third question: if a component has faults or errors, what
subsequent consequences (success or failure) could happen
to the given task?
Fig. 1 illustrates the fault propagation process in a system.
Following the definitions about faults, errors, and failures in
[4], we conclude the basic features of these three concepts
and illustrate some examples as follows:
Fault: An active fault is the origin of an error, e.g.,
environmental factors cause the camera to become con-
taminated or damaged.
Error: The deviation between correct service and incor-
rect service is called error, e.g., a damaged camera sends
an abnormal image to the subsequent object detection
module. This error may lead to a failure.
Failure: A failure happens when the delivered service
deviates from correct service, e.g., the object detection
module receives faulty images and gives incorrect loca-
tion information of objects. A failure could again deliver
an error to the subsequent component and cause another
failure, e.g., task execution module of a manipulator.
For a system, the attributes of the fault or error (e.g., location,
Successful
Inaccurate placing
Grasp failure
Collision
Fig. 2. Real-world scenario and manipulation failures.
type, magnitude, and timing) are significantly related to the
likelihood of failure occurrence. Anomaly detection methods
could recognize such faults or errors by evaluating the
deviation from sensors. But they lack the explanation about
whether the detected fault could cause negative impact on
the task or system. An extreme example could be in a large-
scale ICPS that may receive tons of alerts from the anomaly
detector while the effects of these faults are not clear.
Motivated by this background, we focus on assessing the
potential effect that different faults could bring to the given
task in this paper, rather than a single anomaly detection
task. To this end, we propose a multimodal framework to
predict 1) whether there is an image fault, and 2) whether
a detected fault could lead to an execution failure. As such,
proactive failure detection task is naturally considered as a
classification problem.
In this work, we choose Franka Emika Panda manipulator,
a typical ICPS application, as our robotic platform for
conducting fault injection experiments and data collection.
The manipulator is equipped with a wrist-mounted FRAMOS
D435e depth camera and a MoveIt path planner. We employ
the Robot Operating System (ROS) as the middleware. The
manipulation task is defined as a ‘Color Match Game’
scenario in which the manipulator picks the cube and places
it onto the plate with the corresponding color. Due to the
fact that images are not always clear in many applications
[5], we inject faulty images in this manipulation task and
observe failures. Fig. 2 shows the real-world set up and
actual failure scenarios that we observed. The proposed
framework predicts the probability of failure based on faulty
images and fused information. Our main contributions can
be summarized as follows:
A novel multimodal framework based on deep neural
network is proposed. The framework could effectively
fuse original faulty images with other information (e.g.,
object detection results, planned trajectory) and proac-
tively predict the probability of failure before the task
is executed. Compared with the state-of-the-art failure
detection methods, our framework requires less sensors
and achieves better performance (F1 score) as well.
A ROS-based image fault injection method is proposed,
which could inject two types of faults including noise
and blur. For each type of faults, it has different param-
eters to define the fault magnitude. This method could
be easily deployed on both simulation (Gazebo) and real
world.
A FAULT-to-FAILURE dataset is published. In the
dataset, multimodal information (faulty images, normal
images, planned path, object detection, etc.) from real
world is recorded. To our best knowledge, there is no
publicly released dataset regarding to error propagation
problem of manipulators.
II. REL ATE D WORK S
A. ROS-based Fault Injection
For safety-critical robotic applications, fault injection is
a promising risk analysis method. Various impressive work
about ROS-based fault injection has been presented. For
navigation tasks, a fault injection campaign is carried out
with ROS/Gazebo simulation in [6] to test the proposed fault
tolerance strategy. In [7], an application-aware resilience
analysis framework, MAVFI, is proposed. This framework
supports fault injection via Linux system calls and ROS com-
munication protocol. When it comes to robotic manipulation
tasks, there are also ROS-based fault injection methods. In
[8] and [9], cyber-physical attacks are demonstrated via in-
jecting faults into the control system of teleoperated surgical
robots. In our previous work [10], configurable time-series
faults are injected into a manipulator and we found that not
every fault could lead to a manipulation failure. As such, we
continue to investigate the error propagation problem for a
vision-based manipulator in this work.
B. Multimodal-based Failure Detection
Failure detection is focused on detecting whether the task
execution of a robot is failed. In some literature, Execu-
tion monitoring is interchangeably mentioned with failure
detection. Inceoglu et al. introduces a series of work about
multimodal-based failure detection for manipulators. The
work of [11] presents FINO-Net which could fuse RGB
images, depth images, and audio readings. This end-to-
end framework gives a binary failure detection result (i.e.,
success or failure) during different manipulation tasks. As
an extension, they continue to investigate the performance
of FINO-Net that classifies different failure modes (e.g.,
collision, miss the target, overturn) during the execution [12].
Additionally, for robot assistive feeding task, an LSTM-based
variational autoencoders is adopted to fuse multimodal input
including images, microphone readings, and joint states [13],
[14]. The anomalous behaviors of the manipulator could be
detected via the fusion of high-dimensional data. Other work
in [15] and [16], multimodal inputs are used to detection
book manipulation failures and grasp failures, respectively.
This means that if a manipulation failure happened / is
happening, the current observation from multiple sensors is
different from the past correct patterns. Although current
methods could capture such sensory deviations and detect
failures, the failure already happens. As such, reactive failure
detection is not an effective way to prevent the failure and
protect the system.
Another promising method is proactive failure detection.
This approach combines the current sensory observation
and planned actions to predict the probability of failures
in the future. For robot navigation task and autonomous
driving domain, researchers have proposed many interesting
works. Ji et al. proposes PAAD network to proactively
detect anomalous behaviors for a field robot platform [17].
The multimodal network fuses information of planned path,
RGB image and 2D point cloud to predict probabilities of
failure (e.g., collision or no collision) in next several time
steps. This predictive model has shown dependable failure
detection performance of a real robot navigation task in
the complex and cluttered field environment. Similar work
to PAAD is proposed in LaND [18] and BADGR [19].
Conditioned by a sequence of future control actions, the
neural network takes RGB image as the input and predicts the
probabilities of collision within a prediction horizon. Despite
the comprehensive exploration of proactive failure detection
in mobile robots domain, we consider the same research
question for a manipulator from the error propagation per-
spective. Our starting point is different from above mentioned
approaches but ideas from both are borrowed. We focus on
proactively predicting the execution consequence when the
manipulator has an error. Through this way, faults and errors
that potentially threaten the system could be recognized to
ensure the safety, otherwise they are ignored to guarantee the
working efficiency.
III. MET HOD
Our framework aims to alleviate the conflict between
safety and efficiency. For a complex system running in real-
time, anomaly detector ensures safety but lacks the assess-
ment about whether the fault could cause a failure. However,
frequent downtime for inspections reduces efficiency and in-
creases running costs. Therefore, a proactive failure predictor
is developed to evaluate the probability of failure occurrence
once a fault is detected. As a case study, a typical workflow
for a vision-based manipulator includes several basic steps:
1) Image Acquisition; 2) Feature Extraction;
3) Object Detection; 4) Path Planning; and 5)
Task Execution. The proposed multimodal framework
fuses the faulty image (from 1st step), object detection result
(from 3rd step) and planned path information (from 4th step)
to predict the probability of manipulation failure. Since the
framework could give the prediction before the execution
starts (i.e., before the abnormal behavior occurs), we call it
‘proactive’ framework. We will describe the problem defini-
tion, data collection, and multimodal framework architecture
in the following sections.
TABLE I
FAULT PARAMETERS
Fault types Attributes
Parameters Set notations
Blur Kernel size K={k|50 k70,kN}
Noise Variance V={v|1.5k3.0}
Normal K = 50 K = 60 K = 70
V = 1.5 V = 2.0 V = 2.5 V = 3.0
Fig. 3. Faulty images and object detection results.
A. Problem Definition
In this work, the failure detection task is considered as a
binary classification problem using multiple modalities input.
A function could be defined as:
ˆyr=f(x1
r,x2
r,x3
r,...,xm
r),rN(1)
where ris the round index of executions, m{1,2,3,...}
is the modality index and ˆy{success,f ailure}is the
predicted class label. Proactive failure detection is achieved
via learning a function via learning a function f(·)that maps
all required multimodal data to the label.
Initialize
Randomize objects location
Acquire image
Inject faults
(x ,y)
Raw image
Faulty image
Dataset
Detect objects
Plan path
Fault
Type
Parameter
Execute
Masked image
Waypoints
xi
xm
xp
y
Fig. 4. Data collection process. For each round of pick and place task,
only one type of image faults is injected. During this process, faulty image,
masked image and waypoints are recorded into Dataset.
Initialize
Randomize objects location
Acquire image
Detect anomaly
Anomaly?
Predict failure
Fail?
Idle
Detect objects
Plan path
Execute
Input
Encoder Z Decoder Reconstruction
Error threshold
ImgCNN
MaskCNN
PointCNN
PointCNN
Waypoints2Img
FC layer
Preciction
(x ,y)
Y
Y
N
N
Output
Fauly image xi
Masked image xm
Pick waypoints
Place waypoints
Pick point image xp
Place point image xp'
Fig. 5. Framework architecture overview of real-time implementation. The framework has two phases in online applications, anomaly detection phase and
failure prediction phase. Anomaly detector checks every acquired image and gives a fault report, abnormal or normal. Failure predictor works once there
is a detected faulty image. Faulty image, masked image and planned waypoints are fed to predictor and it gives a failure prediction report, success or fail.
B. Data Collection
We inject faults into a robotic system by manipulating the
original normal image. Two typical image faults are con-
sidered including Noise and Blur. We define the fault space
using the parameters listed in Table 1. And representative
faulty samples are shown in Fig. 3. Description about fault
parameters is as follows:
Blur could occur due to vibration in the surrounding
environment of manipulator. Here we inject blurred im-
ages by manipulating a configurable parameter Kernel
size.
Noise could result from electronic interference or envi-
ronmental factors. Here we inject Gaussian Noise into
original image’s pixel value by randomly changing the
Variance.
Fig. 4 shows the data collection process in the simulator.
During this process, faulty image xiand masked image xm
are obtained as 480 × 640 RGB images. The masked image
is generated using HSV boundaries and OpenCV library.
The planned trajectory xpis a series of waypoints consisting
of expected position data of each joint. This trajectory is
generated from onboard MoveIt path planning library using
Rapidly-exploring Random Trees (RRT) algorithm. All these
data can be obtained before the manipulator starts working.
The ground truth yis a binary output based on the manual
observation results. In our case, we determine y via the final
distance d between the central point of cube and the central
point of plate.
C. Multimodal Framework Architecture
Fig. 5 shows the overview of the multimodal framework
structure. This framework consists of two working modules,
anomaly detection (AD) module and failure prediction (FP)
module. The input of AD module is the RGB image only,
either normal or abnormal. Based on the AD result, FP mod-
ule estimates the probability of failure and decides whether
it should terminate or continue the task. FP module takes all
heterogeneous services generated from each component as
the inputs. In general, FP module fuses faulty images (from
RGB camera sensor), masked images (from object detection
module), and planned waypoints (from path planner) in
parallel at the feature level. Based on these observations,
FP predicts whether a detected faulty image could lead to a
manipulation failure.
1) Autoencoder-based Anomaly Detector (AEAD): We uti-
lize an autoencoder network for unsupervised image anomaly
detection. First, we sampled 5000 normal images and lo-
cations of all cubes and plates are randomized within a
predefined area for each sampled image. Then, all normal
images are compressed and used for the training process.
During the training, the Mean Squared Error (MSE) loss of
pixel-wise values is adopted to optimize the reconstruction
performance. Finally, the well-trained AEAD is deployed as
an anomaly detector. Specifically, AEAD reconstructs every
input image and calculates the reconstruction error. If the
result is higher than a certain threshold θ, AEAD raises a
fault flag indicating the input image is an anomaly. Encoder
and Decoder have a symmetrical structure consists of three
convolutional blocks. In each block, a max pooling layer is
employed after the convolution operation.
2) Multimodal Failure Predictor (MFP): There are two
branches in MFP, image branch and waypoints branch. For
the image branch, faulty images and masked images are
fed and processed by ImgCNN and MaskCNN, respectively.
Both generated features are flattened after the convolution
operation. Then, we utilize a concatenation operation to
fuse data from the camera (faulty images) and the object
detection module (masked images). Another branch is for
processing planned waypoints generated from the onboard
path planner. Before the task is executed, waypoints of
pick and place actions are calculated. This is the key that
the information of planned path could be fused to achieve
proactive prevention from manipulation failures. However,
for a vision-based manipulation task, Inverse Kinematics
(IK) solver is often utilized, and it could generate multiple
solutions for the expected end-effector coordinates. In other
words, the planned trajectory (i.e., waypoints) of each joint
may vary hugely even though the detected coordinates of
objects are same. As such, using noisy waypoints as the
input directly is inefficient for training. Inspired by the idea
about bird’s eye view (BEV), we introduce a Waypoints2Img
module to preprocess the input waypoints and transform
them into a projected image from the camera’s perspective.
Specifically, the last set of position data in the planned
waypoints is first transformed as one set of end-effector’s
coordinates (x, y, z) in the world frame using the following
equation:
0TE(q) = 0T1(q)1T2(q)... 6TE(q) = R
t
0001
(2)
where i1Tiis the transformation matrix of frame iin frame
i-1. qrepresents the planed path, which is the expected joint
angles. The end-effector’s coordinates (x, y, z) are included
in
t. Then, the BEV action point images are obtained via
coordinate transformation between camera frame and end
effector’s frame.
IV. EXP ERI MEN TS
We collected 4000 real-world samples, and each type of
faults is injected equally. Table 2 shows the failure distribu-
tion under the given fault parameter configuration. In order
to test the performance of our framework on different fault
types, we built three datasets: Blur, Noise, and Combination.
Each dataset is split into train (80%) and test (20%) sets.
A. Baselines and Evaluation Metrics
To our best knowledge, such multimodal-based proactive
failure detection methods are widely explored and well
TABLE II
FAILURE DISTRIBUTION
Fault types #Success / #Failure
Blur 1725 / 275
Noise 1320 / 680
approached in the navigation tasks of mobile robots or
autonomous vehicles. But for manipulators, failure detec-
tion mostly remains in a reactive manner. However, our
work considers the error propagation problem from faults
to manipulation failures. Based on this background, we
compare the performance of the proposed framework with
the following baseline methods:
Cui et al.: A multimodal convolutional neural network
which takes RGB images and robot’s actions (velocities,
accelerations, etc.) as the input and predicts autonomous
vehicle’s behavior. To achieve the comparison, we only
replace the original vehicle’s actions with the manipula-
tor’s last waypoint from two planned actions (pick and
place).
PAAD: A multimodal fusion framework that could fuse
trajectory images, RGB images and LiDAR observa-
tions. By the feature-level fusion, PAAD network could
proactively predict the future probabilities of failure
within a sequence of trajectory points for mobile field
robots. In our case, we replace the trajectory image
with the pick and place point images. Additionally,
LiDAR observations are replaced with masked images
and prediction horizon is changed to one (either success
or failure in one execution round).
FINO-Net: A reactive failure detection framework based
on multimodal sensory observations for various ma-
nipulation tasks. FINO-Net takes RGB images, depth
images, and audio readings as the input and detects fail-
ures during the execution. In order to achieve proactive
failure prediction, we replace depth images with masked
images, and its audio branch processes the planned
waypoints data from pick and place action instead.
Note that FINO-Net also considers the temporal correlations
among the input frames/audios via an ConvLSTM module
to achieve consecutive failure detection. But in our case, we
map the discrete input to a discrete execution result. In other
words, a faulty image and its subsequent impact are assumed
as an independent event within one execution round, and
they do not have any influence on other rounds. As such, we
remove the recurrent module of FINO-Net in experiments.
We refer to the evaluation metrics from above methods for
fair comparison, which are Precision, Recall, and F1-score.
B. Experimental Results
Fig. 6 and Table 3 report the failure detection performance
of different models. When it comes to the horizontal compar-
ison between the detection performance on different datasets,
all models show considerably higher F1-score on Blur dataset
than Noise dataset. This means that with the defined config-
uration of fault parameters in Table 1, failures caused by blur
faults are more likely to be detected. We argue the potential
reason is different degrees of data structuring in two datasets.
In our case study, a severely blurred image often results in
the cube not being detected (as shown in Fig. 3). Then the
manipulator moves to a manually predefined pick location
due to the safety consideration. As a result, the pick target
is missed. In the Blur dataset, both object detection results
a) FINO-Net b) PAAD c) Cui et al. d) Ours
Fig. 6. Confusion matrix for Blur, Noise, and Combination dataset.
TABLE III
FAILURE DETE CTIO N PERFORMANCE WITH DIFFER ENT ME TH OD S
Models Blur Noise Combination (Blur + Noise) Inf. Time (msec)
Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
FINO-Net 94.52 99.40 96.90 84.19 81.92 83.04 60.17 98.90 74.82 334.37
PAAD 100.00 95.86 97.88 79.05 89.69 84.03 96.33 89.61 92.85 175.60
Cui et al. 98.85 98.56 98.70 67.59 92.43 78.08 92.50 86.85 89.59 105.54
Our method 100.00 98.02 99.00 87.75 87.75 87.75 94.50 93.72 94.11 13.38
and planned path are relatively more structured and more
distinguishable. However, Noise dataset is less structured so
that it relatively more difficult to give the prediction. In the
Combination dataset, Cui et al. has a decrease in F1 score
compared to Noise dataset, while other three models have
an increment. This means that last three models are better at
adapting to more complex datasets.
In terms of the vertical comparison among all models,
our proposed framework outperforms other state-of-the-art
methods with a slightly superior F1-score. For the results
obtained from Blur dataset, our model achieves the best
Precision. This means that the proposed framework is better
able to identify faults that do not necessarily cause task
failures. From the efficiency perspective, our model is more
capable of avoiding unnecessary fault alarms than other three
baselines. However, FINO-Net has the best Recall and it
could detect more failures. In general, our model leads on
F1-score and performs considerably better than another two
proactive failure detection models (Cui et at. and PAAD)
on all metrics. In the Noise dataset, Cui et at. outperforms
other models by the highest Recall but it fails to recognize
more true positive scenarios. The opposite of this is PAAD
that achieves the highest Precision while more true failures
are missed. We argue that it is difficult for aforementioned
models to tradeoff between false-positive and false-negative
rates. This could decrease either working efficiency or safety.
FINO-Net again achieves an comparable F1-score with our
models but it has a relatively higher false-positive rate. The
follow-up experiments are conducted in a more complex
dataset. Our model achieves the best F1-score and Precision.
Although FINO-Net has a close F1-score, it could not better
identify the fault that could not lead to an execution failure.
Note that compared with the original idea of FINO-Net and
PAAD, our framework does not need extra sensors (audio or
lidar) to estimate the probability of failures. We obtain good
results with simpler set up.
Besides the model’s performance, we also investigate the
offline inference time of all models on Combination dataset.
We argue that shorter inference time is beneficial for online
failure prediction. And the offline inference speed could be
referred for the online processing. In the experiment, the
batch size is 64 and inference time is measured for each
batch. 100 times of experiments are conducted on Nvidia
GeForce RTX 3050 Laptop GPU and average inference time
TABLE IV
ABL ATIO N STU DY RES ULT S
Models Precision Recall F1-score
RGB image only 93.17 90.02 91.56
RGB + Masked image 91.67 93.86 92.75
RGB + Point image 93.17 94.27 93.71
w/o Waypoints2Img 93.17 92.24 92.70
Ours 94.50 93.72 94.11
is calculated. As shown in Table 3, our model is substantially
faster than the other three baseline models.
C. Ablation Study
An ablation study is further conducted to investigate
the contribution of different components in our model. We
choose Combination dataset as the reference and the ablated
versions of the proposed model are: 1) RGB image only: only
the faulty image branch is used to extract the observation
features; 2) RGB + Masked image: only ImgCNN and
MaskCNN pipeline are used to extract the observation fea-
tures; 3) RGB + Point image: only ImgCNN and PointCNN
pipeline are used to extract the observation features; 4) w/o
Waypoints2Img: the PointCNN module is replaced with a
fully connected layer to process the planned waypoints. Table
4 summarizes the ablation study results. The results indicate
that each model design choice has an positive effect on
achieving higher F1-score. Note that our model outperforms
the model without the Waypoints2Img module in all metrics.
This indicates that using action point images as the input
instead of ’noisy’ waypoints data contributes to better overall
performance.
V. CONCLUSION
In this work, we presented a novel multimodal framework
based on deep neural network to implement proactive failure
detection for a vision-based manipulation task. Through a
feature-level fusion of object detection results and planned
paths, our method could successfully evaluate the probability
of execution failure when a degraded image is received.
The proposed framework achieved higher F1-score in each
dataset comparing with the other state-of-the-art methods. In
addition, our framework requires less sensors and has a faster
inference speed.
The further step is to extend the current Fault-to-Failure
dataset by introducing more types of faults. At the same
time, the generalization performance of the model is also an
interesting question for us. Due to the rarity of real world
failure samples, especially for safety-critical applications, we
believe that it is more meaningful and challenging to test
and improve the ability of failure detection in other complex
scenarios.
REFERENCES
[1] Baheti, Radhakisan, and Helen Gill. ”Cyber-physical systems. The
impact of control technology 12.1 (2011): 161-166.
[2] Luo, Yuan, et al. ”Deep learning-based anomaly detection in cyber-
physical systems: Progress and opportunities.” ACM Computing Sur-
veys (CSUR) 54.5 (2021): 1-36.
[3] Kaplan, Stanley, and B. John Garrick. ”On the quantitative definition
of risk.” Risk analysis 1.1 (1981): 11-27.
[4] Avizienis, Algirdas, et al. ”Basic concepts and taxonomy of depend-
able and secure computing.” IEEE transactions on dependable and
secure computing 1.1 (2004): 11-33.
[5] Y. Pei, Y. Huang, Q. Zou, X. Zhang and S. Wang, ”Effects of
Image Degradation and Degradation Removal to CNN-Based Image
Classification,” in IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, vol. 43, no. 4, pp. 1239-1253, 1 April 2021, doi:
10.1109/TPAMI.2019.2950923.
[6] Favier, Anthony, et al. ”A hierarchical fault tolerant architecture for
an autonomous robot.” 2020 50th Annual IEEE/IFIP International
Conference on Dependable Systems and Networks Workshops (DSN-
W). IEEE, 2020.
[7] Hsiao, Yu-Shun, et al. ”Mavfi: An end-to-end fault analysis framework
with anomaly detection and recovery for micro aerial vehicles. 2023
Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE, 2023.
[8] Alemzadeh, Homa, et al. ”Targeted attacks on teleoperated surgical
robots: Dynamic model-based detection and mitigation.” 2016 46th
Annual IEEE/IFIP International Conference on Dependable Systems
and Networks (DSN). IEEE, 2016.
[9] Li, Xiao, et al. ”A hardware-in-the-loop simulator for safety train-
ing in robotic surgery. 2016 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). IEEE, 2016.
[10] Ma, Yuliang, Philipp Grimmeisen, and Andrey Morozov. ”Case Study:
ROS-Based Fault Injection for Risk Analysis of Robotic Manipulator.”
2023 IEEE 19th International Conference on Automation Science and
Engineering (CASE). IEEE, 2023.
[11] Inceoglu, Arda, et al. ”Fino-net: A deep multimodal sensor fusion
framework for manipulation failure detection. 2021 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS). IEEE,
2021.
[12] Inceoglu, Arda, Eren Erdal Aksoy, and Sanem Sariel. ”Multimodal
Detection and Classification of Robot Manipulation Failures.” IEEE
Robotics and Automation Letters (2023).
[13] Park, Daehyung, Yuuna Hoshi, and Charles C. Kemp. ”A multi-
modal anomaly detector for robot-assisted feeding using an lstm-based
variational autoencoder. IEEE Robotics and Automation Letters 3.3
(2018): 1544-1551.
[14] Park, Daehyung, Hokeun Kim, and Charles C. Kemp. ”Multimodal
anomaly detection for assistive robots. Autonomous Robots 43
(2019): 611-629.
[15] Thoduka, Santosh, Juergen Gall, and Paul G. Pl¨
oger. ”Using visual
anomaly detection for task execution monitoring. 2021 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS).
IEEE, 2021.
[16] Gohil, Priteshkumar, Santosh Thoduka, and Paul G. Pl¨
oger. ”Sensor
Fusion and Multimodal Learning for Robotic Grasp Verification Using
Neural Networks.” 2022 26th International Conference on Pattern
Recognition (ICPR). IEEE, 2022.
[17] Ji, Tianchen, et al. ”Proactive anomaly detection for robot navigation
with multi-sensor fusion.” IEEE Robotics and Automation Letters 7.2
(2022): 4975-4982.
[18] Kahn, Gregory, Pieter Abbeel, and Sergey Levine. ”Land: Learning
to navigate from disengagements. IEEE Robotics and Automation
Letters 6.2 (2021): 1872-1879.
[19] Kahn, Gregory, Pieter Abbeel, and Sergey Levine. ”Badgr: An au-
tonomous self-supervised learning-based navigation system. IEEE
Robotics and Automation Letters 6.2 (2021): 1312-1319.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Industrial cyber-physical systems (ICPS) are becoming more complex due to increasing behavioral and structural complexity. This increases the likelihood of faults, errors and failures. This can lead to economic losses and even hazardous events. Fault injection is an efficient method to estimate the potential risk of safety-critical ICPS. In this paper, we propose a new fault injection-based risk analysis method for Robot Operating System (ROS) and demonstrate its applicability with a robot manipulator case study. We conducted extensive fault injection experiments using a pick-and-place task. We injected two types of sensor signal faults: bias and noise. First, fault injections were implemented on a ROS/Gazebo model of the manipulator with randomly selected fault parameters such as fault type, location, magnitude and duration. The experiments helped to identify potential failure scenarios and to find critical fault locations. The most important factor contributing to system failures was the operational phase during which the faults were injected. We then tested our fault injection method on a real Franka Emika Panda collaborative manipulator to validate the effectiveness of the proposed ROSbased fault injection method. We observed that the digital model showed similar behavior to the real manipulator.
Article
Full-text available
Despite the rapid advancement of navigation algorithms, mobile robots often produce anomalous behaviors that can lead to navigation failures. The ability to detect such anomalous behaviors is a key component in modern robots to achieve high-levels of autonomy. Reactive anomaly detection methods identify anomalous task executions based on the current robot state and thus lack the ability to alert the robot before an actual failure occurs. Such an alert delay is undesirable due to the potential damage to both the robot and the surrounding objects. We propose a proactive anomaly detection network (PAAD) for robot navigation in unstructured and uncertain environments. PAAD predicts the probability of future failure based on the planned motions from the predictive controller and the current observation from the perception module. Multi-sensor signals are fused effectively to provide robust anomaly detection in the presence of sensor occlusion as seen in field environments. Our experiments on field robot data demonstrates superior failure identification performance than previous methods, and that our model can capture anomalous behaviors in real-time while maintaining a low false detection rate in cluttered fields. Code is available at https://github.com/tianchenji/PAAD .
Article
An autonomous service robot should be able to interact with its environment safely and robustly without requiring human assistance. Unstructured environments are challenging for robots since the exact prediction of outcomes is not always possible. Even when the robot behaviors are well-designed, the unpredictable nature of the physical robot-object interaction may lead to failures in object manipulation. In this work, we focus on detecting and classifying both manipulation and post-manipulation phase failures using the same exteroception setup. We cover a diverse set of failure types for primary tabletop manipulation actions. In order to detect these failures, we propose FINO-Net [1], a deep multimodal sensor fusion-based classifier network architecture. FINO-Net accurately detects and classifies failures from raw sensory data without any additional information on task description and scene state. In this work, we use our extended FAILURE dataset [1] with 99 new multimodal manipulation recordings and annotate them with their corresponding failure types. FINO-Net achieves 0.87 failure detection and 0.80 failure classification F1 scores. Experimental results show that FINO-Net is also appropriate for real-time use.
Article
Anomaly detection is crucial to ensure the security of cyber-physical systems (CPS). However, due to the increasing complexity of CPSs and more sophisticated attacks, conventional anomaly detection methods, which face the growing volume of data and need domain-specific knowledge, cannot be directly applied to address these challenges. To this end, deep learning-based anomaly detection (DLAD) methods have been proposed. In this article, we review state-of-the-art DLAD methods in CPSs. We propose a taxonomy in terms of the type of anomalies, strategies, implementation, and evaluation metrics to understand the essential properties of current methods. Further, we utilize this taxonomy to identify and highlight new characteristics and designs in each CPS domain. Also, we discuss the limitations and open problems of these methods. Moreover, to give users insights into choosing proper DLAD methods in practice, we experimentally explore the characteristics of typical neural models, the workflow of DLAD methods, and the running performance of DL models. Finally, we discuss the deficiencies of DL approaches, our findings, and possible directions to improve DLAD methods and motivate future research.