Conference PaperPDF Available

NimbRo Picking: Versatile Part Handling for Warehouse Automation


Abstract and Figures

Part handling in warehouse automation is challenging if a large variety of items must be accommodated and items are stored in unordered piles. To foster research in this domain, Amazon holds picking challenges. We present our system which achieved second and third place in the Amazon Picking Challenge 2016 tasks. The challenge required participants to pick a list of items from a shelf or to stow items into the shelf. Using two deep-learning approaches for object detection and semantic segmentation and one item model registration method, our system localizes the requested item. Manipulation occurs using suction on points determined heuristically or from 6D item model registration. Parametrized motion primitives are chained to generate motions. We present a full-system evaluation during the APC 2016 and component-level evaluations of the perception system on an annotated dataset.
Content may be subject to copyright.
NimbRo Picking: Versatile Part Handling for Warehouse Automation
Max Schwarz, Anton Milan, Christian Lenz, Aura Mu˜
noz, Arul Selvam Periyasamy,
Michael Schreiber, Sebastian Sch¨
uller, and Sven Behnke
Abstract Part handling in warehouse automation is chal-
lenging if a large variety of items must be accommodated
and items are stored in unordered piles. To foster research
in this domain, Amazon holds picking challenges. We present
our system which achieved second and third place in the
Amazon Picking Challenge 2016 tasks. The challenge required
participants to pick a list of items from a shelf or to stow
items into the shelf. Using two deep-learning approaches for
object detection and semantic segmentation and one item
model registration method, our system localizes the requested
item. Manipulation occurs using suction on points determined
heuristically or from 6D item model registration. Parametrized
motion primitives are chained to generate motions. We present
a full-system evaluation during the APC 2016 and component-
level evaluations of the perception system on an annotated
Bin-picking problems arise in a wide range of applications,
from industrial automation to personal service robots. In
the case of warehouse automation, the problem setting has
unique properties: While the surrounding environment is usu-
ally very structured—boxes, pallets and shelves—the sheer
number and diversity of objects that need to be recognized
and manipulated pose daring challenges to overcome.
In July 2016, Amazon held the second Amazon Picking
Challenge (APC)1, which provided a platform for compar-
ing state-of-the-art solutions and new developments in bin
picking and stowing applications. The challenge consisted of
two separate tasks, where contestants were required to pick
twelve specified items out of chaotically arranged shelf boxes
shelf—and to stow twelve items from an unordered pile in
a tote into the shelf. Amazon provided a set of objects from
39 categories, representing a large variety of challenging
properties, including transparency (e.g. water bottle), shiny
surfaces (e.g. metal or shrink wrap), deformable materials
(e.g. textiles), black surfaces (difficult to measure depth),
white textureless surfaces, heavy objects, and non-solid ob-
jects with many holes (not easy to grasp with a suction
cup). Also the shiny metal floors of the shelf boxes posed
a considerable challenge to the perception systems, as all
objects are also visible though their mirrored image. Before
the run, the system was supplied with a task file, which
specified the desired objects and the object location (in terms
of shelf boxes or the tote). After the run, the system was
expected to output the new locations of the items.
Our team developed a robotic system for the APC with
some unique properties, which will be presented in this work.
University of Bonn,
Fig. 1. Picking objects from the APC shelf.
Our main contributions include:
1) Development of two deep-learning based object per-
ception methods that employ transfer learning to learn
from few annotated examples (Section V),
2) integration of said deep-learning techniques into a
robotic system,
3) and a parametrized-primitive-based motion generator
which renders motion planning unnecessary (Sec-
tion VI).
Bin picking is one of the classical problems in robotics
and has been investigated by many research groups in
the last three decades, e.g. [1]–[9]. In these works, often
simplifying conditions are exploited, e.g. known parts of
one type being in the bin, parts with holes that are easy
to grasp by sticking fingers inside, flat parts, parts composed
of geometric primitives, well textured parts, or ferrous parts
that can be grasped with a magnetic gripper.
During the APC 2015, various approaches to a more
general shelf-picking problem have been proposed and eval-
uated. Correll et al. [10] aggregate lessons learned during
the APC 2015 and show a general overview and statistics of
the approaches. For example, 36 % of all teams (seven of the
top ten teams) used suction for manipulating the objects.
Eppner et al. [11] describe their winning system for APC
2015. Mechanically, the robot consists of a mobile base and
a 7-DOF arm to reach all shelf bins comfortably. In contrast,
our system uses a larger arm and can thus operate without
IEEE International Conference on Robotics and Automation (ICRA), Singapore, May 2017.
a mobile base (see Section III). The endeffector of Eppner
et al. [11] is designed as a fixed suction gripper, which
can execute top and side picks; front picks are, however,
not possible. For object perception, a single RGB-D camera
captures the scene. Six hand-crafted features are extracted for
each pixel, including color and geometry-based features. The
features are then used in a histogram backprojection scheme
to estimate the posterior probability for a particular object
class. The target segment is found by searching for the pixel
with the maximum probability. After fitting a 3D bounding
box, top or side grasps are selected heuristically. Similar
to our system, motion generation is based on parametrized
motion primitives and feedback is used to react safely
to collisions with the environment, rather than performing
complex motion planning beforehand. The system could not
manipulate the pencil cup object, which our system can pick
with a specialized motion primitive. The team performed
very well at APC 2015 and reached 148 out of 190 points.
Yu et al. [12] reached second place with Team MIT in
the APC 2015. Their system uses a stationary industrial arm
and a hybrid suction/gripping endeffector. The industrial arm
provides high accuracy and also high speed. Similar to our
approach, an Intel RealSense sensor mounted on the wrist is
used for capturing views of the bin scenes (together with
two base-mounted Kinect2 sensors). A depth-only GPU-
based instance registration approach is used to determine
object poses. Again, motion primitives were chosen in favor
of motion planning. Specialized motion primitives can be
triggered to change the configuration inside the bin when no
picking action can be performed (such as tipping an object
over). Team MIT achieved 88 points in the competition.
In contrast to the APC 2015, the 2016 challenge intro-
duced more difficult objects (e.g. the heavy 3 lb dumbbell),
increased the difficulty in the arrangements, and introduced
the new stowing task.
Our robot consists of a 6-DOF arm, a 2-DOF endeffector,
a camera module, and a suction system.
To limit system complexity, we chose to use a stationary
manipulator. This means the manipulation workspace has
to cover the entire shelf, which places constraints on the
possible robotic arm solutions. In our case, we chose the
UR10 arm from Universal Robotics, because it covers the
workspace nicely, is cost-effective, lightweight, and offers
safety features such as an automatic (and reversible) stop
upon contact with the environment.
Attached to the arm is a custom-built endeffector (see
Fig. 2). For reaching into the deep and narrow APC shelf
bins, we use a linear actuator capable of 37 cm extension.
On the tip of the linear extension, we mounted a rotary joint
to be able to carry out both front and top grasps. The rotary
joint is actuated by a pulley mechanism, with the servo motor
residing on the other end of the extension (and thus outside
of the shelf during picking). This means that the cross section
that needs to be considered during motion generation is only
3 cm×3 cm.
Linear Joint
Rotary Joint
Dual RGB-D
Fig. 2. Endeffector with suction finger and dual camera setup.
Fig. 3. Bleed actuator for suction regulation. Left: CAD model. Right:
Final installation.
For grasping the items, we decided to employ a suction
mechanism. This choice was motivated by the large success
of suction methods during the last APC [10], and also due
to the presented set of objects for the APC 2016, most of
which could be manipulated easily using suction. Our suction
system is designed to generate both high vacuum and high
air flow. The former is needed to lift heavy objects, the latter
for objects on which the suction cup cannot make a perfect
vacuum seal.
Air flow is guided from a suction cup on the tip of the
endeffector through the hollow linear extension, and then
through a flexible hose into the robot base. The vacuum
itself is generated by a 3100 W vacuum cleaner meant for
central installation. For binary on/off control, it offers a
12 V control input. Since it overheats quite easily if the air
flow is completely blocked, we added a “bleed” actuator
(see Fig. 3), which can regulate the amount of air sucked
into an additional intake. By closing the intake, we achieve
maximum suction strength, while complete opening reduces
suction to zero. Air flow is measured by a pitot tube inside
the linear extension. This is used to detect whether an object
was successfully grasped or lost during arm motion.
In summary, our kinematic design allows us to apply
suction on all points of the object hemisphere facing the
robot, control suction power quickly and precisely, and
monitor air flow to recognize success or failure.
For control and computations, two computers are con-
nected to the system. The first one, tasked with high- and
low-level control of the robot, is equipped with an Intel
Core i7-4790K CPU (4 GHz). The second one is used for
vision processing, and contains two Intel Xeon E5-2670 v2
(a) RGB frame (b) Upper depth (c) Lower depth (d) Stereo depth (e) Fused result
Fig. 4. RGB-D fusion from two sensors. Note the corruption in the left wall in the lower depth frame, which is corrected in the fused result.
Softmax Curtain liner
Fig. 5. Architecture of the object detection pipeline. Adapted from Johnson et al. [13].
(2.5 GHz) and four NVIDIA Titan X GPUs. For training, all
four GPUs can be used to accelerate training time. At test
time, two GPUs are used in parallel for the two deep learning
approaches (see Section V).
After testing multiple sensors in the APC setting, we
settled on the Intel RealSense SR300 RGB-D sensor due
to its lightweightness, high resolution, and short-range capa-
bilities. However, we noticed that the depth sensor produced
systematic artifacts on the walls of the shelf. The artifacts
seem to depend on the viewing angle, i.e. they were present
only on the right side of the image. To rectify this situation,
we designed a dual sensor setup, with one of the sensors
rotated 180(see Fig. 2).
Using two separate sensors also makes a second RGB
stream available. To exploit this, we also calculate dense
stereo disparity between the two RGB cameras using LIB-
ELAS [14]. The three depth sources are then projected into
a common frame and fused using a majority voting scheme.
Figure 4 shows an exemplary scene with the fused depth
map. The final map is then filled and regularized using a
guided TGV regularizer [15] implemented in CUDA.
For perceiving objects in the shelf or tote, we developed
two independent methods. The first one solves the object
detection problem, i.e. outputs bounding boxes and object
classes for each detection. The second one performs semantic
segmentation, which provides a pixel-wise object classifica-
Since training data and time is limited, it is crucial
not to train from scratch. Instead, both methods leverage
convolutional neural networks (CNNs) pre-trained on large
image classification datasets and merely adapt the network
to work in the domain of the APC.
A. Object Detection
We extend an object detection approach based on the
DenseCap network [13]. DenseCap approaches the prob-
lem of dense captioning, i.e. providing detailed textual
descriptions of interesting regions (bounding boxes) in the
input image. Figure 5 shows the general architecture of the
DenseCap network. A large number of proposals from an
integrated region proposal network are sampled to a fixed
number (1000 in our case) using an objectness score network.
Intermediate CNN feature maps are interpolated to fixed size
for each proposal. The proposals are then classified using a
recognition CNN. The underlying CNN was pretrained on
the ImageNet [16] dataset. Afterwards, the entire pipeline
was trained end-to-end on the Visual Genome dataset [17]. In
order to make use of this pretraining, we use a trained model
distributed by the DenseCap authors and either train a custom
classifier or finetune the entire pipeline (see Sections V-A.1
and V-A.2).
Since the region proposals do not make use of depth,
we augment the network-generated proposals with proposals
from a connected components algorithm running on the
RGB and depth frames (see Fig. 6). Two pixels are deemed
connected if they do not differ more than a threshold in terms
of 3D position, normal angle, saturation and color. Final
Fig. 6. RGB-D based additional region proposals. Left: RGB frame. Center: Regions labeled using the connected components algorithm. Right: Extracted
bounding box proposals.
Fig. 7. Our network architecture for semantic object segmentation.
bounding boxes are extracted from regions which exceed an
area threshold.
While the textual descriptions are not interesting for bin-
picking scenarios, the descriptions are generated from an
intermediate feature vector representation, which is highly
descriptive. To exploit the power of this feature represen-
tation, we use the network without the language model for
feature extraction, and do classification using a linear SVM.
As an alternative, we investigate a soft-max classifier layer,
which allows us to fine-tune the network during training.
1) Linear SVM: In the first case, we remove the language
generation model and replace it with a linear SVM for
classification. We also introduce two primitive features based
on depth: The predicted bounding box is projected into 3D
using the center depth value. The metric area and size are
then concatenated to the CNN features. Since linear SVMs
can be trained very efficiently, the training can happen just-
in-time before actual perception, exploiting the fact that the
set of possible objects in the bin is known. Restricting the
set of classes also has the side-effect that training time and
memory usage are constant with respect to the set of all
objects present in the warehouse.
The SVM is used to classify each predicted bounding
box. To identify a single output, the bounding box with the
maximum SVM response is selected. This ignores duplicate
objects, but since the goal is to retrieve only one object, this
reduction is permissible.
2) Finetuning: For finetuning the network, we use a soft-
max classification layer instead of the SVM. All layers
except the initial CNN layers (see Fig. 5) are optimized. In
contrast to SVM training, the training is performed offline
on all object classes. At test time, all predicted boxes are
classified and the bounding box with the correct class and
highest objectness score is produced as the final output.
B. Semantic Segmentation
Manipulation of real-world objects requires a more precise
localization that goes beyond a bounding box prediction.
Therefore, we also investigated pixel-level segmentation ap-
proaches. To that end, we adapt our previous work [18] to
the scenario at hand. The method employs a 6-layer fully
convolutional neural network (CNN) similar to OverFeat
[19]. The full network architecture is illustrated in Fig. 7.
As a first step, low-level features are extracted from the
captured RGB-D images using a set of filters that was
pretrained on ImageNet [20]. We then finetune the network
to the APC domain by training the last three layers of the
C. Combination
During APC, we used a combination of the SVM ob-
ject detection approach and the semantic segmentation. The
bounding boxes predicted by the object detection were
rendered with a logistic estimate of their probability and
averaged. This process produced a “probability map” that
behaved like a pixel-wise posterior. In the end, we simply
multiplied this probability map with the class probabilities
determined in semantic segmentation. A pixel-wise max-
probability decision then resulted in the final segmentation
mask used in the rest of the pipeline.
After APC, we replaced the hard bounding box rendering
with a soft gaussian, which yielded better results.
D. Item Pose Estimation
For certain objects, manipulation in the constrained space
of the shelf is only possible if the 6D object pose is known.
For example, large objects such as the pack of socks can
only be grasped near the center of mass. Other grasps will
result in tilting the object, making it impossible to remove it
in a controlled manner and without collisions.
To that end, we modeled a dense representation of such
objects using the method and implementation by Prankl et al.
[21], where we capture a 360turntable sequence of point
clouds of the object with the robot’s sensors. We select a
subset of frames that fully captures the object from various
angles. Subsequently, the extrinsic camera parameters are
estimated given the correspondences between frames, which
are finally refined using bundle adjustment. Based on a
manually positioned bounding box, the scene is filtered
to exclude background and false measurements. From the
camera poses and masks of the selected frames, a 3D model
of the object is reconstructed by optimizing correspondences
between frames. Once the object geometry is captured,
Fig. 8. Pose registration. Left column: Turntable capture, resulting model.
Center column: Scene, scene with initializations. Right column: Final
registered model.
Fig. 9. Heuristic grasp selection. Left: Top grasp on an extension cord.
Right: Front grasp on the kleenex tissue box.
we manually attach the desired grasping poses for each
object in turn in order to guarantee a stable grasp. We
experimented with multiple 6D pose estimation methods, and
finally adopted an ICP-based approach, which gave fastest
and most accurate results in our setting.
From the segmentation mask for the particular object, we
can extract object points from the scene point cloud. As we
know that particular models are likely to be positioned in
few orientations (e.g. standing or lying on the ground), we
can define a set of predefined orientations to initialize the
registration. At the moment of performing registration, we
position the model at the center of mass of the filtered point
cloud and perform Generalized ICP [22] on the predefined
set of orientations and choose the 6D pose with the shortest
Euclidean registration distance between scene and model.
Figure 8 shows a tote scene with the tube socks object.
Note that 6D pose registration was only required for three
objects: the duct tape (little supports for suction), the pack of
tube socks (large), and the paper towel roll (large). All other
APC objects can be grasped using generic grasp positions
directly computed from the segmentation mask, which is
described in Section VI-A.
At first glance, the kinematic constraints imposed on
motions in the shelf appear quite severe: The available
space is very narrow, and objects can be partially occluded,
meaning that the robot has to reach around other objects.
A. Heuristic Grasp Selection
For objects which are not registered (see Section V-D),
we select grasps heuristically. Our system supports two basic
grasps: Top grasp and center grasp.
The top grasp is determined using the 3D bounding box of
the object. We select the point belonging to the segmentation
mask, whose projection onto the ground plane is closest to
the projection of the 3D bounding box center. The grasp
height is chosen as the maximum height of object points in
a cylinder around the chosen position. The grasp position is
then refined to the next object point.
Center grasps are defined as grasps close to the 2D image-
space bounding box center. Again, the closest object point to
the center is chosen, this time in image space. The surface
normal is estimated using the local neighborhood and used
as grasp direction.
Figure 9 shows center and top grasps on exemplary scenes.
B. Inverse Kinematics
In order to simplify the problem, we focused on an
intelligent inverse kinematics solver first. The solver is driven
by two ideas: First, the suction pose itself is invariant to
rotations around the suction axis, and second, the solver
should resolve the inherent redundancy in the kinematic
chain so as to minimize the chance of collisions with the
As a basis, we use a selectively damped least squares
(SDLS) solver [23]. We augment it with a null-space op-
timization step, which projects the gradient of a secondary
objective fto the null space of the SDLS Jacobian J.
We first define a joint-level null space objective g:
gi(q) = wlmax{0, q (q+
+wlmin{0, q (q
where iis the joint index, qis the joint position, q+
iare the upper and lower joint limits, qδis a joint limit
threshold, q(c)
iis the “convenient” configuration for this joint,
and wis used to form a linear combination of the costs. As
can be seen, this objective prefers a convenient configuration
and avoids joint limits.
More interestingly in this application, we also specify
Cartesian-space costs using a plane-violation model:
h~n,d(~x) = (max{0,(~n~xT+d)})2,(2)
where ~n and dspecify an oriented plane ~n~xTd= 0, and ~x
is some Cartesian point. This model is used to avoid specified
half-spaces with parts of the robot.
Finally, we obtain the combined costs f:
f(~q, ~xl, ~xw) = X
gi(~qi) + h~ns,ds(~xl)
+h~nt,dt(~xl) + h~nb,db(~xw),
where ~q is the vector of joint positions, ~xland ~xware
Cartesian positions of the linear extension and the camera
module, and ~ni, didescribe three half spaces which are
avoided (see Fig. 10). This half space penalization ensures
that we do not enter the shelf with the cameras, that the linear
Fig. 10. Penalizing planes in IK solver. The red/blue planes penalize
violation by the red/blue spheres, respectively.
Fig. 11. Nullspace-optimizing IK. Left: Front grasp. Right: Side grasp.
extension is horizontal during manipulation in the shelf2, and
that collisions with the robot base are avoided.
One iteration of the solver calculates the update δqas
J, ¯
J+xαNf(~q, ~xl, ~xw),(6)
where Ris the target orientation of the endeffector, Pis
a projector zeroing the roll component (allowing rotation
around the suction axis), Jis the 6×nkinematic Jacobian
matrix, Nis the null space projector of ¯
J,xis the
remaining 6D pose difference, and αis the step size for
null space optimization.
Using this custom IK solver, it is possible to reach difficult
target poses in the shelf and tote without collisions (see
Fig. 11). Note that we used a null-space optimizing solver
before [24], but limited the cost function to joint-space
posture costs. In contrast, the null-space costs are now used
to avoid collisions in task space.
C. Retract planner
For approaching an object, we can follow the camera ray
to the object to get a collision-free trajectory. Retracting with
the object, however, can be more difficult, especially in the
shelf, since other objects might be in front of our target
2This also uses the penalization of linear extension in Eq. (1).
Fig. 12. Retract planning. Left: RGB image of the scene. Center: Front-
projected collision mask for retrieval of the black pencil cup. The sippy
cup is not fully masked because of missing depth values on the transparent
surface. Right: Distance transform.
As gravity keeps the objects on the floor of the shelf bin,
we can always lift the object as high as possible to increase
the chance of collision-free retraction. For simplicity, we
decided to restrict further retract planning to find an optimal
Y coordinate (with the Y axis pointing sideways).
To do this, we first calculate a 2D “skyline” view of the
potential colliding objects (see Fig. 12). After performing
a distance transform, we can easily identify an ideal Y
coordinate with the maximum distance to colliders.
D. Parametrized Motion Primitives
For actual motion generation we use our keyframe-based
interpolation system [24]. Each keyframe specifies either
joint- or Cartesian space configurations for parts of the robot.
It also specifies joint and/or Cartesian velocity and acceler-
ation constraints which limit the motion to this keyframe.
Keyframes can be edited in a dedicated 3D GUI for pre-
defined motions such as dropping items into the tote, or
adapted live to perception results, such as grasp motions.
Finally, motions are smoothly interpolated in joint space and
executed on the robot.
Since our motion generation—while very robust—makes
several strong assumptions, it is still possible that unwanted
collisions with the shelf or other objects are generated. In
particular, there is no mechanism that detects whether it is
actually possible to retrieve the target object. For example,
it may be necessary to move other occluding objects before
attempting actual retrieval. In our experience, however, the
combination of the inverse kinematics solver and the retract
planner are sufficient to solve most situations. As a final pre-
emptive measure, we configure the UR10 to stop and notify
the control software whenever the exerted force exceeds a
threshold. The software then releases the stop, executes a
retract primitive, and continues with the next object. Failed
objects are retried at the end of the picking sequence.
A. Amazon Picking Challenge 2016
The system proposed in this work attempted both the
picking and stowing task successfully during the APC 2016.
For stowing, our system stowed eleven out of twelve items
into the shelf.3However, one of the successfully stowed
items was misrecognized, which meant that the system could
not recognize the final item (a toothbrush). Even though a
3Video at
Bin Item Pick Drop Report
A duct tape × × ×
B bunny book X X ×2
C squeaky eggs X×X
D crayons1X×X
E coffee X X ×2
F hooks X×X
G scissors × × ×
H plush bear X×X
I curtain X×X
J tissue box X×X
K sippy cup X×X
L pencil cup X X ×2
Sum 10 3 7
1Misrecognized, corrected on second attempt.
2Incorrect report, resulting in penalty.
fallback mechanism was built in, which would attempt to
recognize all known objects, this method failed due to an
object size threshold. The misrecognition of the item led to
the attainment of the second place in the stow task.
In the picking task, our system picked ten out of twelve
items.4Despite the high success rate (the winning team
DELFT achieved a success pick-up rate of only nine items),
only a third place was achieved as a consequence of dropping
three items during picking. While this was recognized using
the air velocity sensor, the system incorrectly deduced that
the items were still in the shelf, while they actually dropped
over the ledge and into the tote. Since the system was
required to deliver a report on the final object locations, the
resulting penalties dropped our score from 152 points to 97
points—just behind the first and second place with both 105
On the final day of the competition, the teams had the
chance to showcase their system in an open demonstration.
We chose to retry the picking task in a slightly different
configuration, which allowed us to show our ability to handle
the most difficult objects: The pencil cup, which can only be
suctioned on the bottom side, and the dumbbell, which is
quite heavy (3 lb) for suction-based systems. For the former,
we first push it over on the side. The latter is possible using
our powerful vacuum system.
B. Object Detection
Apart from the system-level evaluation at the APC, we
evaluated our perception approaches on our own annotated
dataset, which was also used for training during APC. The
dataset contains 190 shelf frames, and 117 tote frames. The
frames vary in the number of objects and location in the
shelf. As far as we are aware, this number of frames is
quite low in comparison to other teams, which highlights
the effectiveness of our transfer learning approach. Figure 13
shows an exemplary scene from the dataset with object
detection and segmentation results.
4Video at
Fig. 13. Object perception example. Upper row: Input RGB and depth
frames. Lower row: Object detection and semantic segmentation results
(colors are not correlated).
Shelf Tote
Method Uninformed Informed Uninformed Informed
SVM (plain) - 0.654 - 0.623
SVM (tailor) - 0.661 - 0.617
Finetuned CNN 0.361 0.783 0.469 0.775
Segmentation 0.757 0.787 0.789 0.816
Combination 0.787 0.805 0.813 0.829
SVM (plain): trained on all object classes.
SVM (tailor): trained just-in-time for the objects present in the image.
Combination: Finetuned CNN + Segmentation.
For evaluation, we define a five-fold cross validation split
on the shelf dataset. To see the effect of each design choice,
we evaluate each approach in an informed case (the set of
objects in the bin is known) and in an uninformed case.
For object detection, we calculate area-based precision and
recall from the bounding boxes. For segmentation, pixel-
level precision and recall are calculated. Resulting F1 scores
are shown in Table II. As expected, knowledge of the set
of possible objects improves the performance. Finetuning
the network yields a large gain compared to the SVM
approach. As far as the box-level and pixel-level scores can
be compared, the finetuning object detection approach and
the semantic segmentation approach yield similar results.
Finally, the combination of the finetuned object detector
Fig. 14. F-Score distribution over the objects for object detection. Results
are averaged over the cross validation splits using the finetuned model.
Phase Object detection Segmentation
RGB-D proposal SVM Finetuned
Train - - 45 min ~5 h
Test 1006 ms 3342 ms1340 ms ~900ms
1Includes just-in-time SVM training
and the semantic segmentation yields a small but consistent
increase in performance.
Figure 14 gives an impression of the distribution of
difficulty across the objects. We also measured the runtime
of the different modules on our setup (see Table III). Note
that the two perception approaches usually run in parallel.
In this work, we described our system for the APC 2016,
explained design choices, and evaluated the system in the
competition and on our own dataset. As always, the results of
single-trial competitions are very noisy. Teams may fail due
to technical problems, misunderstandings, and pure chance.
During training, we had better runs than in the competition.
Still, our result proves that the components work in isolation
and together under competition conditions.
As this was the first time in our group that deep-learning
techniques were actually used in a live robotic system, this
was a valuable learning opportunity for us. Indeed, only
the tight integration of perception and action made success
possible—as already noted by Eppner et al. [11]. Maybe
for this reason, there are few ready-to-use deep-learning
implementations for robotics contexts. We hope to reduce
this problem with our source-code release which is planned
together with the publication of this paper. Finally, we will
also release our APC dataset annotated with object polygons
and class labels.
[1] D. Buchholz, D. Kubus, I. Weidauer, A. Scholz, and F. M. Wahl,
“Combining visual and inertial features for efficient grasping and
bin-picking,” 2014, pp. 875–882.
[2] M. Nieuwenhuisen, D. Droeschel, D. Holz, J. St¨
uckler, A. Berner,
J. Li, R. Klein, and S. Behnke, “Mobile bin picking with an
anthropomorphic service robot,” in Robotics and Automation (ICRA),
IEEE International Conference on, 2013, pp. 2327–2334.
[3] A. Pretto, S. Tonello, and E. Menegatti, “Flexible 3D localization
of planar objects for industrial bin-picking with monocamera vision
system,” in IEEE International Conference on Automation Science
and Engineering (CASE), 2013, pp. 168–175.
[4] Y. Domae, H. Okuda, Y. Taguchi, K. Sumi, and T. Hirai, “Fast
graspability evaluation on single depth maps for bin picking with
general grippers,” 2014, pp. 1997–2004.
[5] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match
locally: efficient and robust 3d object recognition,” 2010, pp. 998–
[6] A. Berner, J. Li, D. Holz, J. St¨
uckler, S. Behnke, and R. Klein,
“Combining contour and shape primitives for object detection and
pose estimation of prefabricated parts,” in Proceedings of IEEE
International Conference on Image Processing (ICIP), 2013.
[7] C. Martinez, R. Boca, B. Zhang, H. Chen, and S. Nidamarthi,
“Automated bin picking system for randomly located industrial
parts,” in 2015 IEEE International Conference on Technologies for
Practical Robot Applications (TePRA), 2015, pp. 1–6.
[8] K. N. Kaipa, A. S. Kankanhalli-Nagendra, N. B. Kumbla, S.
Shriyam, S. S. Thevendria-Karthic, J. A. Marvel, and S. K. Gupta,
“Addressing perception uncertainty induced failure modes in robotic
bin-picking,” Robotics and Computer-Integrated Manufacturing, vol.
42, pp. 17–38, 2016.
[9] K. Harada, W. Wan, T. Tsuji, K. Kikuchi, K. Nagata, and H. Onda,
“Iterative visual recognition for learning based randomized bin-
picking,” arXiv preprint arXiv:1608.00334, 2016.
[10] N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K.
Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wur-
man, “Lessons from the amazon picking challenge,” arXiv preprint
arXiv:1601.05484, 2016.
[11] C. Eppner, S. H¨
ofer, R. Jonschkowski, R. Mart´
ın, A. Siev-
erling, V. Wall, and O. Brock, “Lessons from the amazon picking
challenge: four aspects of building robotic systems,” in Proceedings
of Robotics: Science and Systems, AnnArbor, Michigan, Jun. 2016.
[Online]. Available: http: // www. redaktion. tu- berlin.
de/ fileadmin/ fg170/Publikationen _pdf / apc_ rbo_
[12] K.-T. Yu, N. Fazeli, N. Chavan-Dafle, O. Taylor, E. Donlon, G.
D. Lankenau, and A. Rodriguez, “A summary of team MIT’s
approach to the amazon picking challenge 2015,” arXiv preprint
arXiv:1604.03639, 2016.
[13] J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolu-
tional localization networks for dense captioning,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
[14] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo
matching,” in Asian Conference on Computer Vision (ACCV), 2010.
[15] D. Ferstl, C. Reinbacher, R. Ranftl, M. R¨
uther, and H. Bischof,
“Image guided depth upsampling using anisotropic total generalized
variation,” in Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 993–1000.
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet
large scale visual recognition challenge,” International Journal of
Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[17] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S.
Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome:
connecting language and vision using crowdsourced dense image
annotations,” arXiv preprint arXiv:1602.07332, 2016.
[18] F. Husain, H. Schulz, B. Dellen, C. Torras, and S. Behnke, “Com-
bining semantic and geometric features for object class segmentation
of indoor scenes,” IEEE Robotics and Automation Letters, vol. 2, no.
1, pp. 49–55, May 2016, IS SN : 2377-3766.
[19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y.
LeCun, “OverFeat: Integrated recognition, localization and detection
using convolutional networks,CoRR, vol. abs/1312.6229, 2013.
[Online]. Available:
[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L.
Fei-Fei, ImageNet Large Scale Visual Recognition Challenge. 2014.
[21] J. Prankl, A. Aldoma, A. Svejda, and M. Vincze, “RGB-D object
modelling for object recognition and tracking,” in Intelligent Robots
and Systems (IROS), 2015 IEEE/RSJ International Conference on,
IEEE, 2015, pp. 96–103.
[22] A. Segal, D. Haehnel, and S. Thrun, “Generalized-ICP.,” in Robotics:
Science and Systems, vol. 2, 2009.
[23] S. R. Buss and J.-S. Kim, “Selectively damped least squares for
inverse kinematics,Graphics, GPU, and Game Tools, vol. 10, no.
3, pp. 37–49, 2005.
[24] M. Schwarz, T. Rodehutskors, D. Droeschel, M. Beul, M. Schreiber,
N. Araslanov, I. Ivanov, C. Lenz, J. Razlaw, S. Sch ¨
uller, D. Schwarz,
A. Topalidou-Kyniazopoulou, and S. Behnke, “NimbRo Rescue:
solving disaster-response tasks through mobile manipulation robot
Momaro,” Journal of Field Robotics (JFR), vol. 34, no. 2, pp. 400–
425, 2017.
... Another researcher Xing, analyzed information behavior outlining the automatic coding of a manufacturing robot based on a VR modeling language [4]. ...
Full-text available
The idea for the Internet of Robotic Things (IoRT) comes from activities. IoRT allows smart devices to combine sensor information from multiple sources, use broader nearby insights to determine the best reason for the activity and display events on-screen. The Internet of Things can connect web applications to other gadgets to make them smarter. Automated engineers combine the two and arm themselves with sensors that can analyze the current situation. Mechanical improvements such as ordinary voice links (PC) vision sensor innovations and artificial intelligence (AI) have resulted in a turn of events and programmed placement frameworks under advanced machinery. This article examines frameworks for arranging programmed objects to assemble robots and innovative patterns in ordinary language programmed programming, and three-dimensional (3D) visual insight. The relevant statement "regulation scene" should describe the combination of all three modules. This computation enables people to relate through a three-tiered information scene and guides managers to use discourse to provide reasonable criteria whenever an order is legitimate. After receiving precise criteria, the robot can use this diary's planned execution calculations and programmed programming to direct the placement of programmed items. The play area provides a context for the natural product, while the landscape design provides a valuable part of the framework.
... These types of methods decompose and plan the task in terms of a coordinated sequence of action primitives, each of which captures a specific motor behavior (this approach has been used in a wide range of applications, e.g., grasping [12], soccer [13], and assembly [14]). Action primitives methods have been proposed for packing and object arrangement problems, e.g., Schwar et al. [15] developed a controller for robotic picking and stowing tasks based on parameterized motion primitives, Zeng et al. [16] proposed a method for manipulating objects into tightly packed configurations by learning pushing/grasping policies, and Capitanelli et al. [17] tackled the problem of reconfiguring articulated objects by using an ordered set of actions executed by a dual-arm robot. Yet, note that the action primitives adopted by these works cannot capture the complex behaviors that are needed to control the shape of a LEO during a packing task. ...
Full-text available
In this article, we propose a new action planning approach to automatically pack long linear elastic objects into common-size boxes with a bimanual robotic system. For that, we developed a hybrid geometric model to handle large-scale occlusions combining an online vision-based method and an offline reference template. Then, a reference point generator is introduced to automatically plan the reference poses for the predesigned action primitives. Finally, an action planner integrates these components enabling the execution of high-level behaviors and the accomplishment of packing manipulation tasks. To validate the proposed approach, we conducted a detailed experimental study with multiple types and lengths of objects and packing boxes.
... Suction grippers have been widely used for pick-andplace tasks due to their robustness, versatility, and high operating speed [1]- [3]. With growing demand for automatic picking systems in unstructured environments, recent studies on suction grippers focus not only on picking in highlystructured environments such as factory lines but also on picking in cluttered or unstructured real-world environments including bin picking [4], warehouse picking [5], and food handling [6], [7]. ...
Applying suction grippers in unstructured environments is a challenging task because of depth and tilt errors in vision systems, requiring additional costs in elaborate sensing and control. To reduce additional costs, suction grippers with compliant bodies or mechanisms have been proposed; however, their bulkiness and limited allowable error hinder their use in complex environments with large errors. Here, we propose a compact suction gripper that can pick objects over a wide range of distances and tilt angles without elaborate sensing and control. The spring-inserted gripper body deploys and conforms to distant and tilted objects until the suction cup completely seals with the object and retracts immediately after, while holding the object. This seamless deployment and retraction is enabled by connecting the gripper body and suction cup to the same vacuum source, which couples the vacuum picking and retraction of the gripper body. Experimental results validated that the proposed gripper can pick objects within 79 mm, which is 1.4 times the initial length, and can pick objects with tilt angles up to 60{\deg}. The feasibility of the gripper was verified by demonstrations, including picking objects of different heights from the same picking height and the bin picking of transparent objects.
... Such data are otherwise not available from remote or visual sensing technologies or are not precise and effective for limited cases. Nowadays, many robotics applications obtain quite a satisfactory grasping performance just relying on vision data [23,24] but, they can benefit even more by combining tactile sensing for stable grasps. Grasping an item is a preliminary phase task in many operations like programmable assembly [25], packaging [26], palletizing [27], warehousing [9], and sorting [28]. ...
Full-text available
The fourth industrial revolution envisages the use of modern smart technologies to automate traditional manufacturing and industrial practices. However, industrial robots execute mostly pre-programmed jobs and are not able to face challenging tasks in unstructured environments. Industry 4.0 pushes for flexibility on target changes and autonomy. In line with the new principles of Industry 4.0, the proposed work describes an autonomous industrial cell that employs several smart technologies for loading jewelry pieces from a conveyor belt to a hooking frame built on purpose. The cell involves an industrial robot, a custom gripper, pneumatic and electric actuators with the aim of moving and opening the frame hooks, and a custom vision pipeline for detecting the feature of interest during the picking and hooking phases. The implemented pipeline makes use of a stereo camera pair mounted under the robot gripper and two fixed monocular cameras. The method employs HOG feature descriptors and machine learning algorithms for the detection. The software architecture is a component-based designed architecture that uses ROS as the underlying framework and ROS-Industrial packages to control the robot. The robot is controlled with position-based commands to reach intermediate positions in the workspace and with velocity command to implement a visual servoing control scheme that runs at 30 Hz and adjusts the robot position with the feedback of the vision during picking and hooking. The proposed visual servoing approach, thanks to the design of the stereo camera and choice of the optics, is able to perceive the features until the final movement phase, differently from most of the visual servoing employed in the literature that, due to the use of RGB-D camera or other vision apparatus, use an open control loop at a standoff distance. The presented work reaches an accuracy of 95% with a cycle time under 8 s.
... Pick and place tasks are intertwined with the perception capabilities of the robot [40], as classical solutions to pick and place tasks require either object segmentation, or object recognition and pose estimation prior to model-based grasp planning [41]. Modern, data-driven approaches use convolutional neural networks to provide bounding boxes or segmentations, followed by pose estimation which can guide the subsequent picking up of the object [42], [43]. As a clear visual representation of the object is necessary for grasping and manipulation in other domains too, solutions may already exist to further improve the performance of food industry pick and place methods. ...
Full-text available
To better optimise the global food supply chain, robotic solutions are needed to automate tasks currently completed by humans. Namely, phenotyping, quality analysis and harvesting are all open problems in the field of agricultural robotics. Robotic perception is a key challenge for autonomous solutions to such problems as scene understanding and object detection are vital prerequisites to any grasping tasks that a robot may undertake. This work conducts a brief review of modern robot perception models and discusses their efficacy within the agri-food domain.
... The R-CNN algorithm proposed by [7] outperforms the OverFeat end-to-end processing method proposed by [8] and improves the performance by about 50% compared to the traditional object detection algorithm. The YOLO algorithm proposed in [9] follows the regression-based One-Stage approach of the OverFeat algorithm to achieve true end-to-end object detection, and its improved version, YOLOv2 [10], can achieve a detection speed of 155 fps, but the YOLO-like algorithm also has shortcomings, as it is not rational for detecting small objects and objects with overlapping parts, and the algorithm's generalization capability and localization frame accuracy are insufficient. In [11], an SSD detection algorithm is proposed, which predicts object regions on the feature maps output by different convolutional layers, outputs discrete default box coordinates of multiple scales and proportions, and uses small convolutional kernels to predict the coordinate compensation values and category confidence of the candidate boxes. ...
Full-text available
This paper firstly introduces the general architecture of the multifunctional harvesting robot grasping system; then, deep learning is used to investigate the object target recognition, and a set of target detection algorithm based on convolutional neural network is implemented; then, image processing technology is used to realize the function of target object localization, which can guide the multifunctional harvesting robot to complete the picking of the target multifunctional. The experimental results show that the multifunctional harvesting robot has a small calculation error of multifunctional coordinates and has a strong multifunctional recognition and positioning capability.
A robot that picks and places the wide variety of items in a logistics warehouse must detect and recognize items from images and then decide which points to grasp. Our Multi-task Deconvolutional Single Shot Detector (MT-DSSD) simultaneously performs the three tasks necessary for this manipulation: object detection, semantic segmentation, and grasping detection. MT-DSSD is a multi-task learning (MTL) method based on DSSD that reduces the amount of computation and achieves high speed compared to when separate models perform each task. Evaluations using the Amazon Robotics Challenge dataset showed that our model has a better object detection and segmentation performance than comparable methods, and an ablation study showed that MTL could improve the accuracy of each task. Further, robotic experiments for grasping demonstrated that our model could detect the appropriate grasping point.
Full-text available
Metal rolls in a non-ferrous-metal manufacturing workshop manifest the characteristics of symmetry, multiple scales and mutual covering, which poses great challenges for metal roll detection. To solve this problem, firstly, an efficient attention mechanism algorithm named ECLAM (efficient capture location attendant model) is proposed for capturing spatial position features efficiently, to obtain complete location information for metal rolls in a complex environment. ECLAM can improve the ability to extract the spatial features of backbone networks and reduce the influence of the non-critical background. In addition, in order to give feature maps a larger receptive field and improve the weight of location information in multi-scale feature maps, a nonlinear feature fusion module named LFFM (location feature fusion module) is used to fuse two adjacent feature images. Finally, a multi-scale object detection network named L-MSNet (location-based multi-scale object detection network) based on the combination of ECLAM and LFFM is proposed and used to accurately detect multi-scale metal rolls. In the experiments, multi-scale metal roll images are collected from an actual non-ferrous-metal manufacturing workshop. On this basis, a pixel-level image dataset is constructed. Comparative experiments show that, compared with other object detection methods, L-MSNet can detect multi-scale metal rolls more accurately. The average accuracy is improved by 2% to 5%, and the average accuracy of small and medium-sized objects is also significantly improved by 3% to 6%.
Conference Paper
Closed-loop methods are commonly used in manipulator motion planning field to remove errors in joint encoders. Compared to placing markers on both manipulator and obstacles, the method that tracks the robot arm using point cloud data does not rely on calibration of the relative pose between the marker and the end-effector. In this paper, instead of tracking the entire arm, we focus on the components-of-interest (COI), which is the robot end-effector and the obstacle in the collision avoidance task. Since by tracking the endeffector it is not possible to directly get the states of other joints, two methods of solving these problems are proposed, namely visual servoing with nominal Jacobian method and inverse kinematic update method. Experiments demonstrate that tracking the COI can obtain a more accurate end-effector pose compared to tracking the entire arm and the proposed methods can successfully achieve the expect goals.
Conference Paper
Full-text available
We describe the winning entry to the Amazon Picking Challenge 2015. From the experience of building this system and competing, we derive several conclusions: (1) We suggest to characterize robotic system building along four key aspects, each of them spanning a spectrum of solutions - modularity vs. integration, generality vs. assumptions, computation vs. embodiment, and planning vs. feedback. (2) To understand which region of each spectrum most adequately addresses which robotic problem, we must explore the full spectrum of possible approaches. (3) For manipulation problems in unstructured environments, certain regions of each spectrum match the problem most adequately, and should be exploited further. This is supported by the fact that our solution deviated from the majority of the other challenge entries along each of the spectra. This is an abridged version of a conference publication.
Full-text available
This paper proposes a iterative visual recognition system for learning based randomized bin-picking. Since the configuration on randomly stacked objects while executing the current picking trial is just partially different from the configuration while executing the previous picking trial, we consider detecting the poses of objects just by using a part of visual image taken at the current picking trial where it is different from the visual image taken at the previous picking trial. By using this method, we do not need to try to detect the poses of all objects included in the pile at every picking trial. Assuming the 3D vision sensor attached at the wrist of a manipulator, we first explain a method to determine the pose of a 3D vision sensor maximizing the visibility of randomly stacked objects. Then, we explain a method for detecting the poses of randomly stacked objects. Effectiveness of our proposed approach is confirmed by experiments using a dual-arm manipulator where a 3D vision sensor and the two-fingered hand attached at the right and the left wrists, respectively.
Full-text available
Robots that solve complex tasks in environments too dangerous for humans to enter are desperately needed, e.g., for search and rescue applications. We describe our mobile manipulation robot Momaro, with which we participated successfully in the DARPA Robotics Challenge. It features a unique locomotion design with four legs ending in steerable wheels, which allows it both to drive omnidirectionally and to step over obstacles or climb. Furthermore , we present advanced communication and teleoperation approaches, which include immersive 3D visualization, and 6D tracking of operator head and arm motions. The proposed system is evaluated in the DARPA Robotics Challenge, the DLR SpaceBot Cup Qualification and lab experiments. We also discuss the lessons learned from the competitions .
Full-text available
The Amazon Picking Challenge (APC), held alongside the International Conference on Robotics and Automation in May 2015 in Seattle, challenged roboticists from academia and industry to demonstrate fully automated solutions to the problem of picking objects from shelves in a warehouse fulfillment scenario. Packing density, object variability, speed, and reliability are the main complexities of the task. The picking challenge serves both as a motivation and an instrument to focus research efforts on a specific manipulation problem. In this document, we describe Team MIT's approach to the competition, including design considerations, contributions, and performance, and we compile the lessons learned. We also describe what we think are the main remaining challenges.
Full-text available
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.
Conference Paper
Bin picking has been a research topic for years because of the challenges in image processing, robot motion planning and tool system. However, much of the existing work is not applicable to most real world bin picking problems because they are too simplistic or not robust enough for industrial use. In this paper, we developed a robust random 3D bin picking system by integrating the vision system with the robotics system. The vision system identifies the location of candidate parts, then the robot system validates if one of the candidate parts is pickable; if a part is identified as pickable, then the robot will pick up this part and place it accurately in the right location. An ABB IRB2400 robot with an IRC5 controller was chosen for picking up the parts. A 3D vision system was used to locate the parts. Experimental results demonstrated that the system can successfully pick up randomly placed parts in an industrial setting. This system provides a practical and robust solution for the industrial applications that require 3D random bin picking.
We present a comprehensive approach to handle perception uncertainty to reduce failure rates in robotic bin-picking. Our focus is on mixed-bins. We identify the main failure modes at various stages of the bin-picking task and present methods to recover from them. If uncertainty in part detection leads to perception failure, then human intervention is invoked. Our approach estimates the confidence in the part match provided by an automated perception system, which is used to detect perception failures. Human intervention is also invoked if uncertainty in estimated part location and orientation leads to a singulation planning failure. We have developed a user interface that enables remote human interventions when necessary. Finally, if uncertainty in part posture in the gripper leads to failure in placing the part with the desired accuracy, sensor-less fine-positioning moves are used to correct the final placement errors. We have developed a fine-positioning planner with a suite of fine-motion strategies that offer different tradeoffs between completion time and postural accuracy at the destination. We report our observations from system characterization experiments with a dual-armed Baxter robot, equipped with a Ensenso three-dimensional camera, to perform bin-picking on mixed-bins.