Conference PaperPDF Available

Humanoid Learns to Detect Its Own Hands

  • LYRO Robotics
  • Machine Intelligence Ltd.

Abstract and Figures

Robust object manipulation is still a hard problem in robotics, even more so in high degree-of-freedom (DOF) humanoid robots. To improve performance a closer integration of visual and motor systems is needed. We herein present a novel method for a robot to learn robust detection of its own hands and fingers enabling sensorimotor coordination. It does so solely using its own camera images and does not require any external systems or markers. Our system based on Cartesian Genetic Programming (CGP) allows to evolve programs to perform this image segmentation task in real-time on the real hardware. We show results for a Nao and an iCub humanoid each detecting its own hands and fingers.
Content may be subject to copyright.
A preview of the PDF is not available
... However, for a complete emulation of human numerical cognition, artificial models need to be physically embodied, i.e., instantiated into realistic simulations of the human body that can gesture and interact with the surrounding environment, such as humanoid robots (Lungarella et al., 2003). Some development in this field has been achieved with robots being able to detect their own hands, solely using the embedded cameras (Leitner et al., 2013). ...
Full-text available
Numerical cognition is a fundamental component of human intelligence that has not been fully understood yet. Indeed, it is a subject of research in many disciplines, e.g., neuroscience, education, cognitive and developmental psychology, philosophy of mathematics, linguistics. In Artificial Intelligence, aspects of numerical cognition have been modelled through neural networks to replicate and analytically study children behaviours. However, artificial models need to incorporate realistic sensory-motor information from the body to fully mimic the children's learning behaviours, e.g., the use of fingers to learn and manipulate numbers. To this end, this article presents a database of images, focused on number representation with fingers using both human and robot hands, which can constitute the base for building new realistic models of numerical cognition in humanoid robots, enabling a grounded learning approach in developmental autonomous agents. The article provides a benchmark analysis of the datasets in the database that are used to train, validate, and test five state-of-the art deep neural networks, which are compared for classification accuracy together with an analysis of the computational requirements of each network. The discussion highlights the trade-off between speed and precision in the detection, which is required for realistic applications in robotics.
... As far as we know, there are not many works performing robotic hand segmentation exploiting vision only. A notable exception is [14], where J. Leitner et al. propose a genetic algorithm to detect the hands of two different humanoid robotic models. They also propose two approaches for detection: (i) detect the fingertips of the robots' hands, and (ii) detect the full hand. ...
Full-text available
The ability to distinguish between the self and the background is of paramount importance for robotic tasks. The particular case of hands, as the end effectors of a robotic system that more often enter into contact with other elements of the environment, must be perceived and tracked with precision to execute the intended tasks with dexterity and without colliding with obstacles. They are fundamental for several applications, from Human-Robot Interaction tasks to object manipulation. Modern humanoid robots are characterized by high number of degrees of freedom which makes their forward kinematics models very sensitive to uncertainty. Thus, resorting to vision sensing can be the only solution to endow these robots with a good perception of the self, being able to localize their body parts with precision. In this paper, we propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view. It is known that CNNs require a huge amount of data to be trained. To overcome the challenge of labeling real-world images, we propose the use of simulated datasets exploiting domain randomization techniques. We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy. We focus our attention on developing a methodology that requires low amounts of data to achieve reasonable performance while giving detailed insight on how to properly generate variability in the training dataset. Moreover, we analyze the fine-tuning process within the complex model of Mask-RCNN, understanding which weights should be transferred to the new task of segmenting robot hands. Our final model was trained solely on synthetic images and achieves an average IoU of 82% on synthetic validation data and 56.3% on real test data. These results were achieved with only 1000 training images and 3 hours of training time using a single GPU.
... Genetic programming application include the design of logical circuits [47], categorization of data [48], optimization of digital circuits to reduce space in chips [49], evolution of neural networks in various applications [50] and image processing [51][52][53]. ...
... In contrast, the method herein using CGP-IP, it does so solely using its own camera images and does not require any external systems or markers. These results were first shown at the Congress on Evolutionary Computation [Leitner et al., 2013c]. At first the approach was verified by visually detecting the hands of the Nao robot, which are of a less complex design and also simpler in appearance (more details about the Nao detection can be found in Leitner et al. [2013c] ). ...
Full-text available
Although robotics research has seen advances over the last decades robots are still not in widespread use outside industrial applications. Yet a range of proposed scenarios have robots working together, helping and coexisting with humans in daily life. In all these a clear need to deal with a more unstructured, changing environment arises. I herein present a system that aims to overcome the limitations of highly complex robotic systems, in terms of autonomy and adaptation. The main focus of research is to investigate the use of visual feedback for improving reaching and grasping capabilities of complex robots. To facilitate this a combined integration of computer vision and machine learning techniques is employed. From a robot vision point of view the combination of domain knowledge from both imaging processing and machine learning techniques, can expand the capabilities of robots. I present a novel framework called Cartesian Genetic Programming for Image Processing (CGP-IP). CGP-IP can be trained to detect objects in the incoming camera streams and successfully demonstrated on many different problem domains. The approach requires only a few training images (it was tested with 5 to 10 images per experiment) is fast, scalable and robust yet requires very small training sets. Additionally, it can generate human readable programs that can be further customized and tuned. While CGP-IP is a supervised-learning technique, I show an integration on the iCub, that allows for the autonomous learning of object detection and identification. Finally this dissertation includes two proof-of-concepts that integrate the motion and action sides. First, reactive reaching and grasping is shown. It allows the robot to avoid obstacles detected in the visual stream, while reaching for the intended target object. Furthermore the integration enables us to use the robot in non-static environments, i.e. the reaching is adapted on-the- fly from the visual feedback received, e.g. when an obstacle is moved into the trajectory. The second integration highlights the capabilities of these frameworks, by improving the visual detection by performing object manipulation actions.
... A few interesting works in robotics have dealt with the problem of hand detection by using machine learning techniques. The Cartesian Genetic Programming method was used by [25] to learn from visual examples how to detect the robot hand inside an image. Online Multiple Instance Learning was used in [7] for the same task, exploiting proprioceptive information from the arm joints and visual optic flow to automatically label the training images. ...
Full-text available
Humanoid robots have complex kinematic chains whose modeling is error prone. If the robot model is not well calibrated, its hand pose cannot be determined precisely from the encoder readings, and this affects reaching and grasping accuracy. In our work, we propose a novel method to simultaneously i) estimate the pose of the robot hand, and ii) calibrate the robot kinematic model. This is achieved by combining stereo vision, proprioception, and a 3D computer graphics model of the robot. Notably, the use of GPU programming allows to perform the estimation and calibration in real time during the execution of arm reaching movements. Proprioceptive information is exploited to generate hypotheses about the visual appearance of the hand in the camera images, using the 3D computer graphics model of the robot that includes both kinematic and texture information. These hypotheses are compared with the actual visual input using particle filtering, to obtain both i) the best estimate of the hand pose and ii) a set of joint offsets to calibrate the kinematics of the robot model. We evaluate two different approaches to estimate the 6D pose of the hand from vision (silhouette segmentation and edges extraction) and show experimentally that the pose estimation error is considerably reduced with respect to the nominal robot model. Moreover, the GPU implementation ensures a performance about 3 times faster than the CPU one, allowing real-time operation.
... A few interesting works in robotics have used machine learning techniques to deal with the problem of robot hand detection. Leitner et al. (2013) used the Cartesian Genetic Programming method to learn how to detect the robot hand inside an image from visual examples. Online Multiple Instance Learning was used by Ciliberto et al. (2011) for the same task (detect the robot hand), through the use of proprioceptive information from the arm joints and visual optic flow to automatically label the training images. ...
Full-text available
In this paper, we describe a novel approach to obtain automatic adaptation of the robot body schema and to improve the robot perceptual and motor skills based on this body knowledge. Predictions obtained through a mental simulation of the body are combined with the real sensory feedback to achieve two objectives simultaneously: body schema adaptation and markerless 6D hand pose estimation. The body schema consists of a computer graphics simulation of the robot, which includes the arm and head kinematics (adapted online during the movements) and an appearance model of the hand shape and texture. The mental simulation process generates predictions on how the hand will appear in the robot camera images, based on the body schema and the proprioceptive information (i.e. motor encoders). These predictions are compared to the actual images using Sequential Monte Carlo techniques to feed a particle-based Bayesian estimation method to estimate the parameters of the body schema. The updated body schema will improve the estimates of the 6D hand pose, which is then used in a closed-loop control scheme (i.e. visual servoing), enabling precise reaching. We report experiments with the iCub humanoid robot that support the validity of our approach. A number of simulations with precise ground-truth were performed to evaluate the estimation capabilities of the proposed framework. Then, we show how the use of high-performance GPU programming and an edge-based algorithm for visual perception allow for real-time implementation in real world scenarios.
The ability to distinguish between the self and the background is of paramount importance for robotic tasks. The particular case of hands, as the end effectors of a robotic system that more often enter into contact with other elements of the environment, must be perceived and tracked with precision to execute the intended tasks with dexterity and without colliding with obstacles. They are fundamental for several applications, from Human-Robot Interaction tasks to object manipulation. Modern humanoid robots are characterized by high number of degrees of freedom which makes their forward kinematics models very sensitive to uncertainty. Thus, resorting to vision sensing can be the only solution to endow these robots with a good perception of the self, being able to localize their body parts with precision. In this paper, we propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view. It is known that CNNs require a huge amount of data to be trained. To overcome the challenge of labeling real-world images, we propose the use of simulated datasets exploiting domain randomization techniques. We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy. We focus our attention on developing a methodology that requires low amounts of data to achieve reasonable performance while giving detailed insight on how to properly generate variability in the training dataset. Moreover, we analyze the fine-tuning process within the complex model of Mask-RCNN, understanding which weights should be transferred to the new task of segmenting robot hands. Our final model was trained solely on synthetic images and achieves an average IoU of 82% on synthetic validation data and 56.3% on real test data. These results were achieved with only 1000 training images and 3 h of training time using a single GPU.
Full-text available
This short paper describes an approach for collecting a dataset of hand’s pictures and training a Deep Learning network that could enable the iCub robot to count on its fingers using solely its own cameras. Such a skill, mimicking children’s habits, can support arithmetic learning in a baby robot, an important step in creating artificial intelligence for robots that could learn like children in the context of cognitive developmental robotics. Preliminary results show the approach is promising in terms of accuracy.
Full-text available
We describe our software system enabling a tight integration between vision and control modules on complex, high-DOF humanoid robots. This is demonstrated with the iCub humanoid robot performing visual object detection, reaching and grasping actions. A key capability of this system is reactive avoidance of obstacle objects detected from the video stream while carrying out reach-and-grasp tasks. The subsystems of our architecture can independently be improved and updated, for example, we show that by using machine learning techniques we can improve visual perception by collecting images during the robot’s interaction with the environment. We describe the task and software design constraints that led to the layered modular system architecture.
Full-text available
The island model paradigm allows to efficiently distribute genetic algorithms overmultiple processors while introducing a new genetic operator, themigration operator, able to improve the overall algortihmic performance. In this chapter we introduce the generalized island model that can be applied to a broad class of optimization algorithms. First, we study the effect of such a generalized distribution model on several well-known global optimizationmetaheuristics.We consider some variants of Differential Evolution, Genetic Algorithms, Harmony Search, Artificial Bee Colony, Particle Swarm Optimization and Simulated Annealing. Based on an set of 12 benchmark problems we show that in the majority of cases introduction of the migration operator leads to obtaining better results than using an equivalent multi-start scheme.We then apply the generalized island model to construct heterogeneous "archipelagos", which employ different optimization algorithms on different islands, and show cases where this leads to further improvements of performance with respect to the homogeneous case.
Full-text available
Combining domain knowledge about both imaging processing and machine learning techniques can expand the abilities of Genetic Programming when used for image processing. We successfully demonstrate our new approach on several different problem domains. We show that the approach is fast, scalable and robust. In addition, by virtue of using off-the-shelf image processing libraries we can generate human readable programs that incorporate sophisticated domain knowledge.
Conference Paper
Full-text available
We present an easy-to-use, modular framework for performing computer vision related tasks in support of cognitive robotics research on the iCub humanoid robot. The aim of this biologically inspired, bottom-up architecture is to facilitate research towards visual perception and cognition processes, especially their influence on robotic object manipulation and environment interaction. The icVision framework described provides capabilities for detection of objects in the 2D image plane and locate those objects in 3D space to facilitate the creation of a world model.
Conference Paper
Understanding the mechanism mediating the change from inaccurate pre-reaching to accurate reaching in infants may confer advantage from both a robotic and biological research perspective. In this work, we present a biologically meaningful learning scheme applied to the coordination between reach and gaze within a robotic structure. The system is model-free and does not utilize a global reference system. The integration of reach and gaze emerges from the learned cross-modal mapping between reach and vision space as it occurs during the robot-environment interaction. The scheme showed high learning speed and plasticity compared with other approaches due to the low level of training data required. We discuss our findings with respect to biological plausibility and from an engineering perspective, with emphasis on autonomous learning as well as strategies for the selection of new training data.
Accuracy and processing time of commercially available 3D camera systems for clinical gait measurement were measured. Tested systems were: Quick MAG, Video Locus, Peak 5, Ariel, Vicon 370, Elite, Kinemetrix 3D, and Optotrack 3020. For the accuracy measurements, the positions of markers on both ends of a rigid bar were measured and the distance of these markers was calculated and compared to the true value. For the processing time, the time for calculating 3D coordinates from data obtained during normal gait was measured. These values will be useful for intending purchasers of 3D camera systems.
A new method for automatic construction of image trans-formation, Genetic Image Network (GIN), is proposed in this paper. We previously proposed the system of ACTIT (Automatic Construction of Tree-structural Image Transfor-mation). ACTIT constructs tree structured image process-ing filters using Genetic Programming (GP). Generally, net-work structure theoretically includes tree structure (i.e. net-work structure also represent tree structure.). Thus, the de-scription ability of network representation is higher than that of tree structure. In this way, we construct complex image transformations which cannot be constructed by tree structure. We applied GIN to automatically constructing image transformation and compare GIN with ACTIT and show effectiveness of GIN.