Conference PaperPDF Available

A Multimodal Perception System for Detection of Human Operators in Robotic Work Cells

Authors:

Abstract and Figures

Workspace monitoring is a critical hw/sw component of modern industrial work cells or in service robotics scenarios, where human operators share their workspace with robots. Reliability of human detection is a major requirement not only for safety purposes but also to avoid unnecessary robot stops or slowdowns in case of false positives. The present paper introduces a novel multimodal perception system for human tracking in shared workspaces based on the fusion of depth and thermal images. A machine learning approach is pursued to achieve reliable detection performance in multi-robot collaborative systems. Robust experimental results are finally demonstrated on a real robotic work cell.
Content may be subject to copyright.
A Multimodal Perception System for Detection of
Human Operators in Robotic Work Cells*
Marco Costanzo, Giuseppe De Maria, Gaetano Lettera, Ciro Natale and Dario Perrone
Abstract Workspace monitoring is a critical hw/sw com-
ponent of modern industrial work cells or in service robotics
scenarios, where human operators share their workspace with
robots. Reliability of human detection is a major requirement
not only for safety purposes but also to avoid unnecessary
robot stops or slowdowns in case of false positives. The present
paper introduces a novel multimodal perception system for
human tracking in shared workspaces based on the fusion of
depth and thermal images. A machine learning approach is
pursued to achieve reliable detection performance in multi-
robot collaborative systems. Robust experimental results are
finally demonstrated on a real robotic work cell.
I. INTRODUCTION
The paper proposes a sensor fusion strategy which com-
bines depth and thermal images to robustly detect human
operators, e.g., in industrial work cells or in professional
service robotics scenarios, and realize safe human-robot
collaboration (HRC) tasks. The main focus of the current
safety regulations is operators safety during industrial robotic
operations. The safety standards for these applications are
laid out by the International Organization for Standardization
(ISO) 10218-1 [1], 10218-2 [2] and by the upcoming ISO
Proposed Draft Technical Specification (TS) 15066 [3]. Four
types of collaborative scenarios are identified, which are
addressed in post-collision and pre-collision scenarios [4].
Industrial safety requirements do not permit to have the
use of post-collision systems because the physical impact
between the robot and the human operator occurs before the
complete stop of the machinery. Otherwise, a pre-collision
scheme makes use of appropriate exteroceptive sensors to
detect humans and prevent collisions. A Speed and Sepa-
ration Monitoring (SSM) scenario requires that the robot
speed should be monitored according to the robot separation
distance from the human operator. In this paper the SSM
scenario has been selected with the aim to maximize the
production time in industrial work cells or in any professional
service task, and preserve human safety at the same time.
However, it combines two uncertain worlds: the distance
monitoring, which requires an accurate and robust human
*This work was supported by the European Commission under the
H2020 Framework Cleansky 2 JTI – project LABOR, ID n. 785419
https://www.labor-project.eu.
M. Costanzo, G. De Maria, G. Lettera, C. Natale and
D. Perrone are with Dipartimento di Ingegneria, Univer-
sit`
a degli Studi della Campania Luigi Vanvitelli, 81031
Aversa, Italy marco.costanzo@unicampania.it,
giuseppe.demaria@unicampania.it,
gaetano.lettera@unicampania.it,
ciro.natale@unicampania.it,
dario.perrone1@studenti.unicampania.it
c
2019 IEEE. DOI: https://doi.org/10.1109/SMC.2019.8914519
detection algorithm, and the robot speed monitoring, which
should be reactive and efficient.
Distance monitoring can be solved through motion cap-
ture systems, range sensors or artificial vision systems [5].
Nevertheless, localizing human operators robustly is not an
easy task. It is often necessary to fuse several sensors with
different properties. Originally developed for military appli-
cations, thermal cameras provide relevant data to perform
breast cancer diagnostic [6], infrastructure and electrical
systems monitoring, gas or liquid detection [7], inspection
and control tasks in industrial applications [8]. Thermal
cameras are ideal for finding objects of a certain temper-
ature: human detection and tracking (HDT) well fits for this
case, as the body temperature is about 37C. Unfortunately,
thermal cameras do not support depth information, which
are necessary to correctly compute the separation distance
between the operator and the robot and apply the current
regulations. The main difficulty of fusing spatial and thermal
images is that a correspondence between corresponding
pixels needs to be found. Similar sensor fusion approaches
for indoor human detection combine RGB data with depth
information [9], using the Histogram of Oriented Gradients
(HOG) proposed in [10] together with depth feature that
describes the self-similarity of an image. Different strategies
are based on Convolutional Neural Networks (CNN), widely
used for object recognition [11] and human detection [12].
A CNN-based RGB-D human detector exploiting the depth
information to develop a region of interest selection method
(ROI) is proposed in [13]. However, the fusion of thermal
and spatial information has gained attention in the last few
years, especially in fields where the spatial data are used as
the main source of information [14], but nowadays there are
no standardized methods to robustly combine them.
The State-of-Art (SoA) publications propose some meth-
ods to deal with the HRC safety: a volumetric representation
of the areas occupied by operators and by the robot has been
studied in [15] to stop the robot when these areas overlap;
as well as, [16] proposes a potential field method to be
used to generate a collision free path. A further approach is
presented in [17] where a safety index is modeled to modify
the robot trajectories and preserve the cooperative task. Many
of these approaches rely on evasive actions to increase safety.
However, in industrial setting, it is generally recommended
to follow the robot predefined path without deviations.
This research paper tackles the HRC problem by intro-
ducing a novel approach to robustly detect human operators
in collaborative work cells through a multimodal perception
system aimed at minimizing false positives to avoid unnec-
essary robot stops. The paper guarantees human safety in
general multi-robot scenarios. In fact, the algorithm allows
computing the minimum separation distance between every
human operator and every robot within the collaborative
work cell, following the line of the current regulations. The
applicability of the approach in the manufacturing industry
has been obtained not by modifying the robot predefined
path but by scaling down the robot trajectory only when
indispensable, thus trying to maximize the production time
even in presence of humans.
II. HUMAN-ROBOT SEPARATION DISTANCE
This section proposes a novel point-cloud based method-
ology to compute the separation distance between a moving
robot and the closest human operator. Strong emphasis has
been devoted to the proposed HDT strategy which realizes
a robust algorithm to correctly detect human operators into
a HRC scenario in real-time. According to the current
ISOs, the robot speed must be modulated to eventually slow
down the robot pre-programmed trajectory when a dangerous
situation for the human operator occurs.
A. Experimental setup and camera calibration
Nowadays, thermal imaging provides relevant data to per-
form specific robotic applications that require thermal infor-
mation. When combining two or more sources of acquisition,
the resulting multi-sensor system has to be extrinsically
calibrated to find the relative pose between the adopted
sensors. This step can be performed by using a calibration
target. The section explains two developed methods for
camera calibration. The objective is to obtain the pose of
the two cameras as accurate as possible both for the success
of the next merging step and for the final separation distance
computation. Two steps have been necessary.
The first one reliably localizes the pose of the depth
camera with respect to the robot base frame. In literature,
this problem is solved by different calibration procedures,
especially for object recognition applications. Their typical
target is to recognize objects located at about 0.5m from the
camera frame. On the contrary, for the proposed experimental
application, the robot and the operators work about 2.5m far
from the camera. To obtain the desired accuracy, the novelty
of the depth camera extrinsic calibration procedure consists
in exploiting a sphere tracking as detailed below.
The experimental setup of this work is shown Fig. 1 and
consists of two cameras rigidly attached to each other. They
have been arranged in a way that their optical axes are
aligned. The adopted cameras have different field of views
(FOVs) and this implies that some depth pixels (Microsoft
Kinect v1, Focal length: 6.1 mm, FOV: 57x45, image size:
640x480) are outside the thermal image (Optris PI 450, Focal
length: 15 mm, FOV: 38x29, Spectrum: 7.5 to 13 m, image
size: 382x288) and they are not used in the merging step
(see Section II-C.1). The goal of the extrinsic calibration
is to obtain an accurate identification of the camera poses,
which guarantee the minimum accuracy error when the two
camera views are merged.
Fig. 1. The experimental perception system composed of a depth camera
(Microsoft Kinect v1) and a thermal camera (Optris PI 450).
A 3D tracking technique has been developed to calibrate
the depth sensor, by tracking a polystyrene sphere of 0.12m
diameter. The red sphere has been mounted at the robot
end effector, so as to match the center of the sphere with
the end effector frame origin. The developed procedure uses
the M-estimator SAmple Consensus (MSAC) algorithm [18]
(which is an extension of the best known RANdom SAmple
Consensus (RANSAC) algorithm [19]), to find a sphere
which satisfies a radius constraint and provides its geometric
model. The robot has been positioned at specific configura-
tions, which allow to correctly distinguish the target within
the camera view. From the robot joint states, the forward
kinematics immediately computes the position of the red
sphere. At the same time, the developed procedure acquires
the depth image, converts it into point-cloud data [20] (PCD)
and estimates the target model. The method is iterated to try
to cover the entire collaborative workspace and to minimize
the calibration error. Finally, the transformation matrix T
T
Tr
d,
between the robot frame Σrand the depth camera frame Σd,
has been evaluated through an optimization algorithm of a
cost function, which combines the corresponding data.
After the depth camera calibration, the extrinsic calibration
of the thermal camera with respect to the depth camera
has been solved. In [21] and [22] a thermal camera and
a depth camera are calibrated by using a perforated grid
placed in front of the sensors. The procedures assume that the
target is located close enough to the sensors lenses because
the objective is to practically solve the calibration but the
approach revealed unsuitable for the application scenario at
hand. The solution consists in using three spheres attached
to a flat cardboard support and heated to be distinguishable
by both the depth and the thermal cameras.
To obtain an estimation of the transformation matrix
T
T
Td
t, between the depth camera frame Σdand the thermal
camera frame Σt, the spheres have been moved inside the
collaborative workspace by placing the support in 10 config-
urations at distances from the camera in the range where the
human operator is expected to act during the collaborative
task. At every acquisition, the calibration target has been
suitably heated to be detectable from both cameras. The
coordinates p
p
pd
k=xd
kyd
kzd
kTof the kth center of the
target sphere have been directly calculated from the depth
image, while the corresponding thermal point coordinates
have been calculated from the thermal image, assuming the
distance from the lens equal to the depth value, i.e., zt
k=zd
k
and
xt
k=(akcxt)zt
k
fxt
(1)
yt
k=(bkcyt)zt
k
fyt
,(2)
where akand bkare the pixel coordinates of the sphere center
in the thermal image, cxt,cytare the pixel coordinates of
the thermal image center and fxt,fytare the focal lengths
expressed in pixel-related units. Finally, the transformation
matrix T
T
Td
thas been estimated by minimizing a cost function
that combines the corresponding data.
Note that, the intrinsic calibrations have been performed
by using common patterns, i.e., a chessboard pattern for
the depth camera and an heated circular pattern grid for
the thermal camera. These procedures return the intrinsic
calibration matrices of the cameras, thus, the parameters cxt,
cxd,cyt,cyd,fxt,fxd,fyt,fyt(index drefers to the depth
camera), which are needed to compute the PCD and (1)-(2).
B. Segmentation pipeline
The basic assumption of the proposed segmentation al-
gorithm (blue pipeline in Fig. 2) is to process exclusively
the information related to the dynamic objects present into
the observed scene. This is because every point-cloud based
strategy always represents a computationally heavy opera-
tion, then a Background Segmentation step has been initially
adopted to subtract the static environment.
Section II-A describes the experimental setup in which the
cameras monitor the surroundings of the manipulator and the
robot kinematic chain is fully visible, as shown in Fig. 2A.
While the collaborative workspace is observed, the robot
executes its task, thus becoming a dynamic entity. Therefore,
the package Real-time URDF Filter [23] has been integrated
at the beginning of the pipeline to distinguish the depth pixels
belonging to the robot model from those belonging to other
dynamic entities and assign them a Not-a-Number (NaN)
value, see Fig. 2B.
The background filtering has been developed through
an efficient algorithm that performs the subtraction of a
stored background, at pixel level: 50 frames of the static
background are initially captured and the mean value of
each pixel is stored in a memory area. At every acquisition,
the current frame subtracts the static frame, as shown in
Fig. 2C. The depth image is then converted to PCD and
a uniform sampling filter is applied to make the algorithm
more reactive, by reducing the clouds density. Finally, the
detection of dynamic entities is executed through a PCD
Clustering step, which processes the point-cloud scene and
provides some clusters as many as single dynamic areas are
detected in the foreground. The Euclidean cluster extraction
method is performed to distinguish all the clusters into
the collaborative workspace. Figure 2D shows two detected
dynamic entities visualized in RViz together with the robot
model. To compensate the sensors measurement noise that
could sometimes provide false clusters, a first constraint is
enforced by defining a minimum cardinality that the areas in
the foreground should have, to be large enough to represent
a human entity. However, the correct discrimination about
validity of the cluster as a real human entity is done through
a novel developed HDT algorithm described in Section II-C.
The Human Validation step waits for the cluster check from
the HDT pipeline, which executes the human detection as
explained in Section II-C.
C. Human Detection and Tracking
The proposed human detection approach makes use of a
Convolutional Neural Network (CNN), whose innovative in-
put is obtained by combining depth and thermal images. The
two information have been processed though a simple and
intuitive pixel-by-pixel technique, never presented elsewhere.
This choice did not allow the authors to use pre-existing data-
sets and pre-trained CNNs but the CNN had to be trained
with a novel, multi-sensory data-set.
A depth camera is suitable for different kinds of environ-
ments because of its adaptability to different lighting con-
ditions: in this work it is adopted to compute the minimum
distance between the human operator and the robot to apply
regulations, but a depth image lets also to distinguish and
localize human surface shapes and their volume, so it is
appropriate to the HDT problem. On the other hand, the
thermal camera distinguishes temperatures and it is ideal for
finding objects of a specific temperature as the human body,
which is about 37C.
CNN solutions which process only thermal images can
be often not sufficient in those applications where large,
hot objects, e.g., tools used in the manual task or, more
generally, small temperature gradients could be present. On
the contrary, CNN trained to detect human shapes into pure
depth images can confuse human operators with objects of
similar shapes, e.g., a plastic mannequin, a coat rack, a
lamp. Therefore, merging depth and thermal images makes
the HDT approach more robust, avoiding false positives and
making correct decisions about human classification. More-
over, the proposed CNN strategy allows also to correctly
localize the human operators into the observed scene. This
information is then sent to the segmentation pipeline to select
who is the human cluster which is the closest to the robot
and continue the computation of the separation distance from
the manipulator. In these terms, thermal imaging is used as
complementary information to spatial ones and represents
a vision strategy that goes beyond SoA human detection
approaches based on background subtraction.
1) Depth-Thermal mapping: The extrinsic calibrations
explained in Section II-A are a first step towards a correct
mapping, that means finding matches between the depth
image and the thermal image. Since the adopted cameras
have different FOVs and resolutions, the resulting map size
must correspond to the smallest one. According to the
experimental setup shown in Fig. 1, the mapping step builds
Fig. 2. Implemented human detection and tracking pipeline: the background segmentation (first three blue labels) processes the depth image to subtract the
static environment and to highlight only dynamic shapes. At the same time, the depth image and the corresponding thermal image are merged into a single
RGB channel (green and red labels, respectively). The obtained images are then combined through a mapping matrix to reliably localize workers (orange
labels) and distinguish them from non-human clusters (human validation step); eventually, the minimum distance between the closest human operator and
the robot is computed.
Fig. 3. The algorithm for mapping depth and thermal images finds pixel-
by-pixel matches between the two images: the result is a 382×288 matrix.
a 382 ×288 matrix, the size of the thermal image shown in
Fig. 3.
The mapping step has been solved through a pixel-by-
pixel procedure: the pixel of the depth image, of indices
(m,n), contains the depth value, zd
m,n, which can be acquired
to compute the corresponding Cartesian point coordinates
p
p
pd
m,n=xd
m,nyd
m,nzd
m,nT, similarly to (1)-(2),
xd
m,n=(mcxd)zd
m,n
fxd
(3)
yd
m,n=(ncyd)zd
m,n
fyd
.(4)
The Cartesian point is then expressed with reference to the
thermal camera frame through the relation
p
p
pt
m,n
1=T
T
Tt
dp
p
pd
m,n
1.(5)
Using the intrinsic parameters of the thermal camera, the
corresponding pixel indices of the point p
p
pt
m,ninto the thermal
image (a,b)are finally computed by inverting (1)-(2). If
they are contained in the FOV of the thermal image, the
corresponding depth pixel indices (m,n)are written into
the mapping matrix at the indices (a,b); otherwise, they
are discarded because they are outside the mapping image
size. Note that, if the observed object is far enough, the
mapping matrix does not depend on the distance. Thus the
mapping matrix can be computed offline by using a fixed
value of zd
m,ncompatible with the working area, hence saving
computational load.
2) Sensor fusion: Multimodal sensor images can be com-
bined through different image fusion techniques, which
work at different merging levels: pixel-by-pixel, combining
signals, using relevant features or at symbol levels. This
paper provides an image fusion algorithm at pixel level but
represents a novel approach with respect to the most widely
used pixel-level image fusion algorithms [24] which never
merge depth and thermal information.
The first requirement of the proposed sensor fusion ap-
proach is to preserve all valid and useful information from
the two sources to be combined, while not introducing distor-
tions. For the purpose of this work, the depth image and the
corresponding thermal image have been merged to provide an
enhanced single view of a scene with extended information
content, through the mapping matrix of Section II-C.1. The
proposed approach, callable RGB Mapping Approach (RGB-
MA), consists in defining the intensities of empty RGB
channels. This is also because, to the best of the authors
knowledge, CNNs work better with RGB images. RGB-MA
strength is that it assigns the same priority to the input
sources. The result is no longer a grayscale, as a depth image
alone could be, or a weighted average image which assigns
different priorities to the sources, but it is an RGB image
where the depth data have been mapped on the green channel
(see Fig. 2H) and the temperature values have been mapped
on the red channel (see Fig. 2K). Specifically, the original
depth sensor value, sd, and the corresponding temperature
sensor value, st, have to be normalized into the interval [0,1]
(see Fig. 2G and J). To do this, a minimum and a maximum
variability ranges of the source values have been defined for
both thermal, mint,maxt, and depth, mind,maxd, cameras.
They do not actually correspond to the ranges of the sensors
technical specifications, but they have been chosen according
to the values detectable into the considered workspace. More
in detail, the detectable depth values are included between
0.30m and 4.0m, while the detectable temperature values
are within the range [0,50]C, which are suitable for any
type of human detection task.
The color information inserted into the specific channel of
the (i,j)-th pixel of the output image must be mapped to 8
bits. The R(red) value is computed, by acquiring st
i,jfrom
the thermal image, as
Ri,j=round 255 st
i,jmint
maxtmint!; (6)
the G(green) value is computed by acquiring sd
m,nfrom the
depth image, where mand nare contained into the (i,j)-th
value of the mapping matrix (Section II-C.1),
Gi,j=round 255 sd
m,nmind
maxdmind!; (7)
the B(blue) value of the resulting image is always zero.
The result is shown in Fig. 2L. Note that the proposed
image fusion technique leaves another channel that could be
used for a further input source. Section IV report the results
of the approach for a typical SSM scenario.
3) CNN for Human Detection: To assess the effectiveness
of the fused images of Section II-C.2, YOLOv3 [25] has been
used for real-time human detection. The selected framework
is an off-the-shelf SoA 2D object detector pre-trained on
ImageNet [26] and fine-tuned on the MS-Coco [27] data-set.
It is an extremely fast and accurate object detection system,
which is born to detect semantic objects of a certain class,
e.g., humans, buildings and cars, in RGB images. Nowadays,
there are no neural networks which have been trained on
combined images such as those proposed by this paper, so
the YOLOv3 CNN model has been re-trained to adapt the
detection system to the Depth-Thermal (D-T) images. The
following steps have been executed:
definition of a Human class;
exclusion of the YOLOv3 pre-trained classes from the
prediction;
building of the training data-set acquiring frames from
D-T video stream;
manual labelling of each frame;
retrain of the YOLOv3 CNN weights.
After the training step, the CNN has been applied to the
real time D-T video stream to obtain the human prediction. A
bounding box is estimated around each detected human and
its coordinates are finally sent to the point-cloud pipeline,
as shown in Fig. 2M. Note that both the training and the
prediction process need high computational cost and they
have been executed on a proper GPU (NVIDIA Titan V).
D. Human-Robot separation distance
Once the predicted bounding boxes are sent to the seg-
mentation pipeline, each cluster is verified: each point of
the cluster is transformed into depth pixel coordinates (by
inverting (3)-(4)). Since the bounding box is expressed in
the thermal image plane, the selected pixel is converted
into depth image coordinates through the mapping matrix
(Section II-C.1). If at least 50% of the cluster points belong
to a bounding box, the cluster is labeled as human and passes
the check. Figure 2E shows two clusters: the red human
operator, which is correctly detected by the CNN, and the
yellow plastic mannequin, which is correctly not labeled as
human.
The Human Validation check is a fundamental step to
compute the correct separation distance between human
operators and the robot to apply the actual regulations of
industrial robotic applications. Therefore, the last step of the
segmentation pipeline (Fig. 2F) identifies the nearest pair of
points, one belonging to the robot (P
R) and the other one
belonging to the operator (P
H), that minimize the distance,
i.e.,
P
HH,P
RR|d(P
H,P
R)d(P0
H,P0
R)
P0
HH,P0
RR,(8)
where d(·,·)is the Euclidean distance between two points,
Hand Rrepresent the set of all points that belong to the
operator and the robot, respectively.
If P
His detected through the HDT strategy, a robot model-
ing method has been implemented to detect P
R. To take into
account the link volumes and not only specific points, the
proposed solution uses spheres as in [28] and [29] to model
robot links. The kinematic chain has been padded through 12
dummy frames to include the robot homogeneously, creating
a virtual 0.10m diameter safety sphere around each of them.
Therefore, the pair of closest points can be immediately
identified: the algorithm calculates the distance between all
points of the verified clusters point clouds and the origin of
every robot frame. The robot point P
Rwill be on the closest
virtual sphere along the line connecting the origin with P
H.
This step strongly justifies the choice of a point cloud
based approach. In fact, it provides satisfactory accuracy and
precision: it allows tracking humans also when they are not
completely visible from the camera view, unlike common
skeleton-based techniques; it is not necessary that human
operators are in front of the camera view because the point
Fig. 4. Identification of the minimum distance points between the whole
robot (yellow sphere) and the closest human operator (purple sphere).
cloud-based approach will recognize them anyway; more
detailed body parts can also be detected, e.g., a elbow, the
head, an hand, the chin or the chest. Figure 4 shows the
results: the developed CNN distinguishes human operators
belonging to the Human class (in red) from other clustered
objects (in yellow), i.e., a plastic mannequin and a chair,
which are not labeled as humans and they are not considered
for the safety separation distance computation, even if they
are possibly closer to the robot. Note that the closest human
cluster is selected in case of many human operators.
III. TRAJECTORY SCALING
Industrial SSM scenarios allow the robot system and the
human operator to move concurrently in the collaborative
workspace. Risk reduction is achieved by maintaining at least
the minimum protective separation distance, S, between the
operator and the robot [3], assumed constant here. During
robot motion, the robot system never gets closer to the
operator than S. When the separation distance, d, is less than
S, the robot system stops, before it can impact the operator.
When the operator moves away from the robot system, it can
resume the motion automatically while maintaining at least
the protective separation distance.
The proposed strategy ensures human-robot coexistence
according to the standard regulations methodology, introduc-
ing a low speed mode which slows down the robot nominal
velocity when the operator is inside a hazardous workspace.
The approach guarantees the robot task efficiency by using a
time-scaling method to change robot operating speed without
introducing acceleration discontinuities.
A. Single robot work cell
With reference to a single robot case, a typical industrial
pre-programmed task, T, is composed by Npositions, ˜
q
q
qi,
associated to velocities ˙
˜
q
q
qi, accelerations ¨
˜
q
q
qiand temporal
Fig. 5. Relation between the computed separation distance d(x axis) and
the scale factor k(y axis).
instants ˜
ti. Typically the pre-programmed joint positions have
to be interpolated according to the sampling time Tcrequired
by the robot. In this work a quintic interpolation is used, i.e.,
the planned interpolated trajectory is
˜
q
q
qh=p
p
p5(th;T)(9)
˙
˜
q
q
qh=p
p
p4(th;T)(10)
th+1=th+Tc(11)
where this the h-th discrete time instant, p
p
p4is the derivative
of the polynomial p
p
p5,˜
q
q
qhand ˙
˜
q
q
qhare the planned joint position
and velocity at time threspectively.
The proposed algorithm modulates the robot speed by
scaling the time with a safety scale factor k, which can
assume values inside the interval [0,1]. The scale factor is
related to d(Section II-D) as shown in Fig. 5. When dis
below the danger distance dd=S,kis 0 and the robot stops.
When the distance dis far from the warning distance dw>S,
i.e., d>dw, the robot can move at full speed to improve the
production time. Between ddand dwthe function in Fig. 5
smoothly varies to avoid acceleration discontinuities.
Practically, the trajectory is scaled by computing (9) with
a scaled time τh, i.e.,
q
q
qh=p
p
p5(τh;T)(12)
τh+1=τh+kTc(13)
where q
q
qhis the actual joint command at time th. Obviously,
the joint command q
q
qh, as well as the scaled time τh, are gen-
erated with sampling time Tc. The effect of this approach is
the actual scaling of the joints velocities. In fact, using (13),
˙
ττh+1τh
Ts
=k.(14)
By differentiation (12), the following equation demonstrates
that the velocity is scaled by the safety factor k
˙
q
q
qh=p
p
p4(τh;T)k.(15)
This approach guarantees that the task Tremains the same
in position, but, simultaneously, the resulting velocity is
scaled according to k. When the operator is going to be
into a dangerous situation, the robot operates at diminished
capacity with limited velocity according to human robot
collaboration norms, until restoration of safety conditions.
Experimental results are shown in Section IV.
Fig. 6. Two cases of human false positives: a dummy labeled as human
in the depth CNN approach (left); a hot moving robot labeled as human in
the temperature CNN approach (right).
B. Multi-robot work cell
The strategy discussed so far can be easily extended to
multi-robot work cells. It is necessary to pay close attention
to distinguish the independent robots case from the cooper-
ating robots case.
If the work cell is composed by robots which execute
independent taskseach robot must be slowed down according
to the separation distance with the proper closest human
operator. Thus, the whole pipeline proposed in this paper
is executed for each robot and each robot speed is scaled
independently from the others. This solution has a positive
impact on the production time because it reduces the speed
only of the robots involved in dangerous situations.
On the contrary, if the work cell is composed by coop-
erating robots, e.g., an assembly line or the transportation
of a commonly held object as in [17],the application of the
strategy to the single robots independently can compromise
the task execution. Whereas, the perception pipeline must
be executed for each robot worker until the computation of
the scaling factors, Then, the scale factor corresponding to
the most dangerous robot is applied to all the configuration
variables of the whole robotic system to preserve the coop-
erative task. Suitable danger metrics can be defined to take
this decision, e.g., the scale factor of the robot closest to the
human can be selected.
IV. EXPERIMENTAL RESULTS
The section shows a complete experiment that represents
a typical SSM collaborative scenario to describe the advan-
tages of the approach. Emphasis has been devoted to the
proposed DT-CNN to better highlight the performance.
A. Performances of DT-CNN
To test the performance of the Sensor Fusion approach,
two others CNNs have been trained based on the depth (D-
CNN) and temperature (T-CNN) information, respectively.
The networks have been trained and tested with the same
input data-set. Note that the training phase needed about
1000 training samples to reach the presented performance.
This observation represents another great advantage of using
the proposed approach if compared with many pre-trained
CNNs, which usually needed tens of thousands of images to
be trained.
The Mean Average Precision (maP) has been adopted
as a metric to measure the accuracy of each CNN [30].
TABLE I
CNNSTES TING RE SULTS
CNN mAP % False Positives % False Negatives %
D-CNN 65.09 36.87 17.53
T-CNN 62.76 64.35 4.37
DT-CNN 57.54 2.47 11.85
10 15 20 25 30 35 40 45 50
0
0.5
1
10 15 20 25 30 35 40 45 50
0
0.5
1
Fig. 7. Experiment: an operator enters the shared workspace while the
robot is moving. The top plot shows the estimated distance robot-operator
d, the dangerous distance ddand the warning distance dw. The bottom plot
shows the trajectory scaling factor kadopted to scale the robot velocity.
Considering the bounding boxes returned from the prediction
and the ground truth, the estimation of mAP is based on
the calculation of various metrics such as precision, recall
and Intersection over Union (IoU). Since the mAP does
not consider false positives and negatives, the percentage of
the erroneous detection has been estimated as an additional
metric.
Comparing the results of Table I, the DT-CNN provides
a mAP slightly lower than other two methods since it is
computed by considering only true positives. Whereas, the
percentage of false positives is considerably lower than the
others. In all cases the percentage of false negatives remains
low. The high percentage of false positives of single-source
approaches is to be found in cases where hot objects (T-
CNN) or objects with shapes comparable to human ones (D-
CNN) can be confused with a human (Fig. 6).
B. Complete experiment
To evaluate the combined approach (human detection and
trajectory scaling) a SSM scenario has been experimentally
tested. A video of the experiment is available at https:
//youtu.be/BrcvKmSiR9Q. A collaborative manufac-
turing industrial operation has been considered. The robot ex-
ecutes a pre-planned task at a given nominal speed. Suddenly,
an operator enters the robot workspace to perform some
manual operations (see the accompanying video). Results
are shown in Fig. 7. At about 15s the operator enters the
collaborative workspace and the system starts to measure
the separation distance d(blue line). When dgoes below the
warning distance dwthe trajectory scaling factor kbecomes
lower than 1 and the robot reduces its velocity without
changing its path. In some intervals of the experiment dgoes
below the dangerous distance ddthus kbecomes 0 and the
robot stops.
In the accompanying video the DT-CNN has been also
compared with the networks that use single sources (depth
or thermal). It is clear that the DT-CNN ensures a better
human detection minimizing the false positives.
This experiment demonstrates how the proposed approach
is able to automatically detect human operators in collabora-
tive workspaces and modulate the robot velocity according
to the current regulations.
V. CONCLUSIONS
This work shows a multimodal perception system based
on a thermal and a depth camera adopted to detect human
operators in general multi-robot work cells. The cameras
have been coupled in a fixed way and calibrated. Results
show that the calibration error is small enough to implement
a new sensor fusion technique to process the camera images
and robustly detect humans into the observed scenario. The
fusion technique consists in defining an RGB image by
combining depth and thermal images on two channels. The
fused image is then processed by a CNN specifically trained
to detect humans in the workspace. The approach based on
the fused images has been demonstrated to be more efficient
than single source perception data in a real collaborative
scenario. Future developments will be devoted to devise
an SSM strategy where, rather than assuming a constant
minimum protective distance, it is computed based on the
actual robot and human velocities estimated through the
same perception data used for distance computation. A risk
analysis based on the real velocity information can lead
to the definition of a less conservative minimum protective
distance, hence maximizing productivity.
REFERENCES
[1] “Robots and robotic devices - Safety requirements for industrial robots.
Part 1: Robots.” International Organization for Standardization” Tech-
nical report, 2011.
[2] “Robots and robotic devices - Safety requirements for industrial robots.
Part 2: Robot system and integration.” International Organization for
Standardization” Technical report, 2011.
[3] “Robots and robotic devices - collaborative robots.” International
Organization for Standardization” Technical report, 2016.
[4] J. Heinzmann and A. Zelinsky, “Quantitative safety guarantees
for physical human-robot interaction,” The International Journal of
Robotics Research, vol. 22, no. 7-8, pp. 479–504, jul 2003.
[5] F. Flacco, T. Kroger, A. D. Luca, and O. Khatib, “A depth space
approach to human-robot collision avoidance,” in 2012 IEEE Interna-
tional Conference on Robotics and Automation. IEEE, may 2012.
[6] N. Arora, D. Martins, D. Ruggerio, E. Tousimis, A. J. Swistel, M. P.
Osborne, and R. M. Simmons, “Effectiveness of a noninvasive digital
infrared thermal imaging system in the detection of breast cancer,
The American Journal of Surgery, vol. 196, no. 4, pp. 523–526, oct
2008.
[7] M. Vollmer and K.-P. Mllmann, Infrared Thermal Imaging. Wiley-
VCH Verlag GmbH & Co. KGaA, dec 2017.
[8] H. Kaplan, Practical Applications of Infrared Thermal Sensing and
Imaging Equipment. SPIE Publications, 2007.
[9] B. Li, H. Jin, Q. Zhang, W. Xia, and H. Li, “Indoor human detection
using RGB-d images,” in 2016 IEEE International Conference on
Information and Automation (ICIA). IEEE, aug 2016.
[10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05). IEEE, 2005, pp. pp.
886–893 vol. 1.
[11] M. Liang and X. Hu, “Recurrent convolutional neural network for
object recognition,” in 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). IEEE, jun 2015.
[12] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
and Y. LeCun, “Overfeat: Integrated recognition, localization and
detection using convolutional networks,CoRR, vol. abs/1312.6229,
2013. [Online]. Available: http://arxiv.org/abs/1312.6229
[13] K. Zhou and A. Paiement, “Detecting humans in rgb-d data with
cnns,” in 15th IAPR International Conference on Machine Vision
Applications (MVA), Nagoya University, Nagoya, Japan. IEEE, 2017.
[14] S. Vidas, P. Moghadam, and M. Bosse, “3d thermal mapping of
building interiors using an RGB-d and thermal camera,” in 2013 IEEE
International Conference on Robotics and Automation. IEEE, may
2013.
[15] P. Rybski, P. Anderson-Sprecher, D. Huber, C. Niessl, and R. Sim-
mons, “Sensor fusion for human safety in industrial workcells,” in
2012 IEEE/RSJ International Conference on Intelligent Robots and
Systems. IEEE, oct 2012.
[16] P. Zhang, P. Jin, G. Du, and X. Liu, “Ensuring safety in human-robot
coexisting environment based on two-level protection,” Industrial
Robot: An International Journal, vol. 43, no. 3, pp. 264–273, may
2016.
[17] M. Lippi and A. Marino, “Safety in human-multi robot collaborative
scenarios: a trajectory scaling approach,” IFAC-PapersOnLine, vol. 51,
no. 22, pp. 190–196, 2018.
[18] S. Choi, T. Kim, and W. Yu, “Performance evaluation of ransac
family,” in Procedings of the British Machine Vision Conference 2009.
British Machine Vision Association, 2009.
[19] C. Papazov and D. Burschka, “An efficient ransac for 3d object
recognition in noisy and occluded scenes,” in Computer Vision – ACCV
2010. Springer Berlin Heidelberg, 2011, pp. 135–148.
[20] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (PCL),” in
IEEE International Conference on Robotics and Automation. IEEE,
may 2011.
[21] T. Luhmann, J. Piechel, and T. Roelfs, “Geometric calibration of ther-
mographic cameras,” in Thermal Infrared Remote Sensing. Springer
Netherlands, 2013, pp. 27–42.
[22] J. Rangel and S. Soldan, “3d thermal imaging: Fusion of thermog-
raphy and depth cameras,” in Proceedings of the 2014 International
Conference on Quantitative InfraRed Thermography. QIRT Council,
2014.
[23] N. Blodow, “Realtime urdf filter,” 2012. [Online]. Available:
http://github.com/blodow/realtime urdf filter
[24] B. Yang, Z. liang Jing, and H. tao Zhao, “Review of pixel-level image
fusion,” Journal of Shanghai Jiaotong University (Science), vol. 15,
no. 1, pp. 6–12, feb 2010.
[25] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,
CoRR, vol. abs/1804.02767, 2018. [Online]. Available: http://arxiv.
org/abs/1804.02767
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09,
2009.
[27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll´
ar, and C. L. Zitnick, “Microsoft COCO: Common objects in
context,” in Computer Vision – ECCV 2014. Springer International
Publishing, 2014, pp. 740–755.
[28] S. I. Choi and B. K. Kim, “Obstacle avoidance control for redun-
dant manipulators using collidability measure,” in Proceedings 1999
IEEE/RSJ International Conference on Intelligent Robots and Systems.
Human and Environment Friendly Robots with High Intelligence and
Emotional Quotients (Cat. No.99CH36289). IEEE, 1999.
[29] P. Bosscher and D. Hedman, “Real-time collision avoidance algorithm
for robotic manipulators,” in 2009 IEEE International Conference on
Technologies for Practical Robot Applications. IEEE, 2009, pp. 113–
121.
[30] W. Su, Y. Yuan, and M. Zhu, “A relationship between the average
precision and the area under the roc curve,” in Proceedings of the 2015
International Conference on The Theory of Information Retrieval,
ser. ICTIR ’15. New York, NY, USA: ACM, 2015, pp. 349–352.
[Online]. Available: http://doi.acm.org/10.1145/2808194.2809481
... To satisfy the requirements of the standard ISO/TS 15066, all the parameters of Equation (1) were considered, assuming V h and V s as constant, even though the speed modulation system proposed in [37] is able to optimize these values, relying on a risk analysis performed online through a fuzzy inference system. Figure 2b shows an example of the end effector normalized speed trend as function of the distance between the robot and the human hands (Figure 2a) when these last are detected according to the multimodal system proposed in [38]. It is worth to note that the robot has not a 1 or 0 condition of speed, but its speed is modulated according to the distance from the worker and even the minimum separation distance S is adjusted in real time again on the basis of the actual robot and human relative positions and velocities (see [38] for more details). ...
... Figure 2b shows an example of the end effector normalized speed trend as function of the distance between the robot and the human hands (Figure 2a) when these last are detected according to the multimodal system proposed in [38]. It is worth to note that the robot has not a 1 or 0 condition of speed, but its speed is modulated according to the distance from the worker and even the minimum separation distance S is adjusted in real time again on the basis of the actual robot and human relative positions and velocities (see [38] for more details). ...
... To satisfy the requirements of the standard ISO/TS 15066, all the parameters of Equation (1) were considered, assuming Vh and Vs as constant, even though the speed modulation system proposed in [37] is able to optimize these values, relying on a risk analysis performed online through a fuzzy inference system. Figure 2b shows an example of the end effector normalized speed trend as function of the distance between the robot and the human hands (Figure 2a) when these last are detected according to the multimodal system proposed in [38]. It is worth to note that the robot has not a 1 or 0 condition of speed, but its speed is modulated according to the distance from the worker and even the minimum separation distance S is adjusted in real time again on the basis of the actual robot and human relative positions and velocities (see [38] for more details). ...
Article
Full-text available
In current industrial systems, automation is a very important aspect for assessing manufacturing production performance related to working times, accuracy of operations and quality. In particular, the introduction of a robotic system in the working area should guarantee some improvements, such as risks reduction for human operators, better quality results and a speed increase for production processes. In this context, human action remains still necessary to carry out part of the subtasks, as in the case of composites assembly processes. This study aims at presenting a case study regarding the reorganization of the working activity carried out in workstation in which a composite fuselage panel is assembled in order to demonstrate, by means of simulation tool, that some of the advantages previously listed can be achieved also in aerospace industry. In particular, an entire working process for composite fuselage panel assembling will be simulated and analyzed in order to demonstrate and verify the applicability and effectiveness of human–robot interaction (HRI), focusing on working times and ergonomics and respecting the constraints imposed by standards ISO 10218 and ISO TS 15066. Results show the effectiveness of HRI both in terms of assembly performance, by reducing working times and ergonomics—for which the simulation provides a very low risk index.
... Data from each modality are applied individually or combined with others. In 374 articles, im-age modality was present in 247, which were image data individually used in 129 articles [49], [51], [52], [54], [56], [57], [59]- [61], [63], [ [123], [128], [130], [134], [135], [137]- [149], [151], [152], [154], [155], [157], [158], [160], [162], [172]- [174], [176], [178], [191], [194]- [196], [198], [202], [203], [209], [210], [212], [214], [215], [217], [221], [222], [224], [225], [228], [247], [256], [267], [268], [272], [291], [307], [308], [310]- [312], [317], [319]- [323], [327]- [329], [333], [337], [339], [341], [343]- [348], [350], [359], [369], [372], [373], [397], [400], [402], [414], [416] and 118 articles combined with other types. Video modality was found in a total of 45 articles, with 11 papers [165], [218], [269], [330], [378], [380]- [382], [387], [388], [390] separately and 34 combined with data of different modalities. ...
... A total of 212 articles related to fusion learning were encountered. Of 155 articles, 99 were model-agnostic, where 62 pertained to early [55], [56], [58], [59], [62], [ [119], [120], [133], [141], [142], [166], [173], [207], [213], [240], [242], [250], [252], [254], [258], [259], [270], [271], [280], [282], [299], [303], [305]- [307], [313], [320], [324], [326], [330], [334], [337], [347], [349], [357], [359], [364], [367], [381], [382], [384], [391], [393], [397], [405], [406], 23 pertained to late [114], [127], [136], [153], [161], [167], [174], [180], [181], [218], [241], [256], [264], [276], [279], [308], [322], [360], [365], [366], [387], [389], [392] and 14 pertained to hybrid [182], [189], [253], [296], [310]- [312], [315], [316], [318], [319], [323], [325], [380]. In all, 56 model-based studies were discovered, with 46 relating to VOLUME 4, 2016 ...
Article
Full-text available
Multimodal machine learning (MML) is a tempting multidisciplinary research area where heterogeneous data from multiple modalities and machine learning (ML) are combined to solve critical problems. Usually, research works use data from a single modality, such as images, audio, text, and signals. However, real-world issues have become critical now, and handling them using multiple modalities of data instead of a single modality can significantly impact finding solutions. ML algorithms play an essential role by tuning parameters in developing MML models. This paper reviews recent advancements in the challenges of MML, namely: representation, translation, alignment, fusion and co-learning, and presents the gaps and challenges. A systematic literature review (SLR) applied to define the progress and trends on those challenges in the MML domain. In total, 1032 articles were examined in this review to extract features like source, domain, application, modality, etc. This research article will help researchers understand the constant state of MML and navigate the selection of future research directions.
... While vision is the most important modality in robot perception, the advantages of analyzing other modalities such as tactile or audio are explored further in recent years [1], [2], [3], [4]. Especially through the combination in multimodal perception, additional sensory modalities are valuable. ...
Conference Paper
Full-text available
The key role of tactile sensing for human grasping and manipulation is widely acknowledged, but most industrial robot grippers and even multi-fingered hands are still designed and used without any tactile sensors. While the basic design principles for resistive or capacitive sensors are well known, several factors keep tactile sensing from large-scale deployment-high sensor costs, short lifespan, poor reliability, difficult production processes, a lack of suitable software and tools for system integration, and the unique requirement for tactile sensors to conform to application-specific shapes. In this work, we describe a very simple but efficient approach to design low-cost resistive matrix sensors, where sensor layout and geometry, taxel-size, and measurement sensitivity can be customized over a wide range. Sensor assembly needs nothing more than a hobby cutting plotter for precise cutting of aluminum tape and Velostat foils, as well as adhesive plastic tape. Our electronics combines transimpedance amplifiers with common Arduino microcontrollers, supporting standard communication protocols, and using either cabled or wireless data transfer to the host. We present three different application examples and sketch our ROS software for sensor calibration and visualization. All parts of our project, including detailed building instructions, bill-of-materials, electronics, and firmware are available open-source.
... Furthermore, a complete assembly cycle requires to perform the mentioned tasks on thousands of holes per aircraft, thus tight time constraints are usually imposed on each single operation to keep the production rates high. Also, since in most cases the human intervention is still required, the automated workcell must be adapted for human cooperation [1,4,5]. In particular, a higher degree of safety is necessary, avoiding dangerous robot movements and configurations. ...
Preprint
Full-text available
Aerospace production volumes have increased over time and robotic solutions have been progressively introduced in the aeronautic assembly lines to achieve high-quality standards, high production rates, flexibility and cost reduction. Robotic workcells are sometimes characterized by robots mounted on slides to increase the robot workspace. The slide introduces an additional degree of freedom, making the system kinematically redundant, but this feature is rarely used to enhance performances. The paper proposes a new concept in trajectory planning, that exploits the redundancy to satisfy additional requirements. A dynamic programming technique is adopted, which computes optimized trajectories, minimizing or maximizing the performance indices of interest. The use case is defined on the LABOR (Lean robotized AssemBly and cOntrol of composite aeRostructures) project which adopts two cooperating six-axis robots mounted on linear axes to perform assembly operations on fuselage panels. Considering the needs of this workcell, unnecessary robot movements are minimized to increase safety, the mechanical stiffness is maximized to increase stability during the drilling operations, collisions are avoided, while joint limits and the available planning time are respected. Experiments are performed in a simulation environment, where the optimal trajectories are executed, highlighting the resulting performances and improvements with respect to non-optimized solutions.
... Furthermore, a complete assembly cycle requires to perform the mentioned tasks on thousands of holes per aircraft, thus tight time constraints are usually imposed on each single operation to keep the production rates high. Also, since in most cases the human intervention is still required, the automated workcell must be adapted for human cooperation [1,4,5]. In particular, a higher degree of safety is necessary, avoiding dangerous robot movements and configurations. ...
Article
Full-text available
Aerospace production volumes have increased over time and robotic solutions have been progressively introduced in the aeronautic assembly lines to achieve high-quality standards, high production rates, flexibility and cost reduction. Robotic workcells are sometimes characterized by robots mounted on slides to increase the robot workspace. The slide introduces an additional degree of freedom, making the system kinematically redundant, but this feature is rarely used to enhance performances. The paper proposes a new concept in trajectory planning, that exploits the redundancy to satisfy additional requirements. A dynamic programming technique is adopted, which computes optimized trajectories, minimizing or maximizing the performance indices of interest. The use case is defined on the LABOR (Lean robotized AssemBly and cOntrol of composite aeRostructures) project which adopts two cooperating six-axis robots mounted on linear axes to perform assembly operations on fuselage panels. Considering the needs of this workcell, unnecessary robot movements are minimized to increase safety, the mechanical stiffness is maximized to increase stability during the drilling operations, collisions are avoided, while joint limits and the available planning time are respected. Experiments are performed in a simulation environment, where the optimal trajectories are executed, highlighting the resulting performances and improvements with respect to non-optimized solutions.
... Munaro et al. estimated the human's pose based on a color camera with a depth sensor and analyzed the images to generate features that are used to learn the model of a SVM [30]. Recent work in CNNs inspired Costanzo et al. to use it for human detection in industrial environments and robotic work cells [31]. Due to the advanced performance in accuracy and the option that features can be learned through the CNN, greatly simplifying the process of ML. ...
Conference Paper
Abstract—The reliability and robustness of systems using machine learning to detect humans is of high importance for the safety of workers in a shared workspace. Developments such as deep learning are advancing rapidly, supporting the field of robotics through increased perception capabilities. An early detection of humans will support robot behavior to reduce downtime or system stoppages due to unsafe proximity between humans and robots. In this work, we present an industry-oriented experimental setup, in which humans and robots share the same workplace. We have created our own dataset to detect humans wearing different clothing. We evaluate Faster R-CNN and SSD which are state-of-the-art detectors on two different camera viewpoints. In addition, this paper elaborates on the requirements for validating the safety of such a system to be used in industrial safety applications. Index Terms—Artificial Intelligence, Human Detection, Industrial Safety Standards, Machine Learning, Robotics
Preprint
Full-text available
Robotic vision for human-robot interaction and collaboration is a critical process for robots to collect and interpret detailed information related to human actions, goals, and preferences, enabling robots to provide more useful services to people. This survey and systematic review presents a comprehensive analysis on robotic vision in human-robot interaction and collaboration over the last 10 years. From a detailed search of 3850 articles, systematic extraction and evaluation was used to identify and explore 310 papers in depth. These papers described robots with some level of autonomy using robotic vision for locomotion, manipulation and/or visual communication to collaborate or interact with people. This paper provides an in-depth analysis of current trends, common domains, methods and procedures, technical processes, data sets and models, experimental testing, sample populations, performance metrics and future challenges. This manuscript found that robotic vision was often used in action and gesture recognition, robot movement in human spaces, object handover and collaborative actions, social communication and learning from demonstration. Few high-impact and novel techniques from the computer vision field had been translated into human-robot interaction and collaboration. Overall, notable advancements have been made on how to develop and deploy robots to assist people.
Article
Full-text available
Robotic vision for human-robot interaction and collaboration is a critical process for robots to collect and interpret detailed information related to human actions, goals, and preferences, enabling robots to provide more useful services to people. This survey and systematic review presents a comprehensive analysis on robotic vision in human-robot interaction and collaboration over the last 10 years. From a detailed search of 3850 articles, systematic extraction and evaluation was used to identify and explore 310 papers in depth. These papers described robots with some level of autonomy using robotic vision for locomotion, manipulation and/or visual communication to collaborate or interact with people. This paper provides an in-depth analysis of current trends, common domains, methods and procedures, technical processes, data sets and models, experimental testing, sample populations, performance metrics and future challenges. This manuscript found that robotic vision was often used in action and gesture recognition, robot movement in human spaces, object handover and collaborative actions, social communication and learning from demonstration. Few high-impact and novel techniques from the computer vision field had been translated into human-robot interaction and collaboration. Overall, notable advancements have been made on how to develop and deploy robots to assist people.
Article
This article investigates the problem of controlling the speed of robots in collaborative workcells for automated manufacturing. The solution is tailored to robotic cells for cooperative assembly of aircraft fuselage panels, where only structural elements are present and robots and humans can share the same workspace, but no physical contact is allowed, unless it happens at zero robot speed. The proposed approach addresses the problem of satisfying the minimal set of requirements of an industrial human–robot collaboration (HRC) task: precision and reliability of human detection and tracking in the shared workspace; correct robot task execution with minimum cycle time while assuring safety for human operators. These requirements are often conflicting with each other. The former does not only concern with safety only but also with the need of avoiding unnecessary robot stops or slowdowns in case of false-positive human detection. The latter, according to the current regulations, concerns with the need of computing the minimum protective separation distance between the human operator and the robots by adjusting their speed when dangerous situations happen. This article proposes a novel fuzzy inference approach to control robot speed enforcing safety while maximizing the level of productivity of the robot minimizing cycle time as well. The approach is supported by a sensor fusion algorithm that merges the images acquired from different depth sensors with those obtained from a thermal camera, by using a machine learning approach. The methodology is experimentally validated in two experiments: the first one at a lab-scale and the second one performed on a full-scale robotic workcell for cooperative assembly of aeronautical structural parts. Note to Practitioners —This article discusses a way to handle human safety specifications versus production requirements in collaborative robotized assembly systems. State-of-the-art (SoA) approaches cover only a few aspects of both human detection and robot speed scaling. The present research work proposes a complete pipeline that starts from a robust human tracking algorithm and scales the robot speed in real time. An innovative multimodal perception system composed of two depth cameras and a thermal camera monitors the collaborative workspace. The speed scaling algorithm is optimized to take on different human behaviors during less risky situations or more dangerous ones to guarantee both operator safety and minimum production time with the aim of better profitability and efficiency for collaborative workstations. The algorithm estimates the operator intention for real-time computation of the minimum protective distance according to the current safety regulations. The robot speed is smoothly changed for the psychological advantages of operators, both in the case of single and multiple workers. The result is a complete system, easily implementable on a standard industrial workcell.
Article
Full-text available
In this paper, a strategy to handle the human safety in a multi-robot scenario is devised. In the presented framework, it is foreseen that robots are in charge of performing any cooperative manipulation task which is parameterized by a proper task function. The devised architecture answers to the increasing demand of strict cooperation between humans and robots, since it equips a general multi-robot cell with the feature of making robots and human working together. The human safety is properly handled by defining a safety index which depends both on the relative position and velocity of the human operator and robots. Then, the multi-robot task trajectory is properly scaled in order to ensure that the human safety never falls below a given threshold which can be set in worst conditions according to a minimum allowed distance. Simulations results are presented in order to prove the effectiveness of the approach.
Article
Full-text available
Purpose The purpose of this paper is to provide a novel methodology based on two-level protection for ensuring safety of the moving human who enters the robot’s workspace, which is significant for dealing with the problem of human security in a human-robot coexisting environment. Design/methodology/approach In this system, anyone who enters the robot’s working space is detected by using the Kinect and their skeletons are calculated by the interval Kalman filter in real time. The first-level protection is mainly based on the prediction of the human motion, which used Gaussian mixture model and Gaussian Mixture Regression. However, even in cases where the prediction of human motion is incorrect, the system can still safeguard the human by enlarging the initial bounding volume of the human as the second-level early warning areas. Finally, an artificial potential field with some additional avoidance strategies is used to plan a path for a robot manipulator. Findings Experimental studies on the GOOGOL GRB3016 robot show that the robot manipulator can accomplish the predetermined tasks by circumventing the human, and the human does not feel dangerous. Originality/value This study presented a new framework for ensuring human security in a human-robot coexisting environment, and thus can improve the reliability of human-robot cooperation.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Recently, RGB-D sensors such as Kinect and Xtion have received considerable attention since they provide depth image that is robust to light variation in the environment. They are mainly used for human computer interaction, surveillance and so on. In this paper, we concentrate on indoor human detection using RGB-D images. Some RGB image based features such as histogram of oriented gradient (HOG) and local binary pattern (LBP) are first briefly introduced. Then, a new depth feature that describes the self-similarity of an image is proposed. Finally, combination of them is utilized to detect the people. This scheme can efficiently describe the humans in the indoor environment. Extensive experiments demonstrate that the proposed scheme can achieve a respective promising detection accuracy of 99.28%, 95.48% and 99.91% on three different collected RGB-D data sets.
Conference Paper
For similar evaluation tasks, the area under the receiver operating characteristic curve (AUC) is often used by researchers in machine learning, whereas the average precision (AP) is used more often by the information retrieval community. We establish some results to explain why this is the case. Specifically, we show that, when both the AUC and the AP are rescaled to lie in [0,1], the AP is approximately the AUC times the initial precision of the system.
Book
This new up-to-date edition of the successful handbook and ready reference retains the proven concept of the first, covering basic and advanced methods and applications in infrared imaging from two leading expert authors in the field. All chapters have been completely revised and expanded and a new chapter has been added to reflect recent developments in the field and report on the progress made within the last decade. In addition there is now an even stronger focus on real-life examples, with 20% more case studies taken from science and industry. For ease of comprehension the text is backed by more than 590 images which include graphic visualizations and more than 300 infrared thermography figures. The latter include many new ones depicting, for example, spectacular views of phenomena in nature, sports, and daily life.