Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents semantic-based methods for the understanding of human movements in robotic applications. To understand human movements, robots need to first, recognize the observed or demonstrated human activities, and secondly, learn different parameters to execute an action or robot behavior. In order to achieve that, several challenges need to be addressed such as the automatic segmentation of human activities, identification of important features of actions, determine the correct sequencing between activities, and obtain the correct mapping between the continuous data and the symbolic and semantic interpretations of the human movements. This paper aims to present state-of-the-art semantic-based approaches, especially the new emerging approaches that tackle the challenges of finding generic and compact semantic models for the robotics domain. Finally, we will highlight potential breakthroughs and challenges for the next years such as achieving scalability, better generalization, compact and flexible models, and higher system accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Beyond these flagship domains, semantics have also been investigated in a range of other subdomains. [31] reviewed the use of semantics in the context of understanding human actions and activities, to enable a robot to execute a task. They classified semantics-based methods for recognition into four categories: syntactic methods based on symbols and rules, affordance-based understanding of objects in the environment, graph-based encoding of complex variable relations, and knowledge-based methods. ...
... In the context of robotic vision, Ramirez-Amaro et al. [31] present a recent survey of the most representative approaches using semantic descriptions for recognition of human activities, with the intention of subsequent execution by robots. In such a context it is advantageous to characterize human movement through multiple levels of abstraction (in this particular case four) from high-level processes and tasks to low-level activities and primitive actions. ...
Preprint
Full-text available
For robots to navigate and interact more richly with the world around them, they will likely require a deeper understanding of the world in which they operate. In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world "mean" to a robot, and is strongly tied to the question of how to represent that meaning. With humans and robots increasingly operating in the same world, the prospects of human-robot interaction also bring semantics and ontology of natural language into the picture. Driven by need, as well as by enablers like increasing availability of training data and computational resources, semantics is a rapidly growing research area in robotics. The field has received significant attention in the research literature to date, but most reviews and surveys have focused on particular aspects of the topic: the technical research issues regarding its use in specific robotic topics like mapping or segmentation, or its relevance to one particular application domain like autonomous driving. A new treatment is therefore required, and is also timely because so much relevant research has occurred since many of the key surveys were published. This survey therefore provides an overarching snapshot of where semantics in robotics stands today. We establish a taxonomy for semantics research in or relevant to robotics, split into four broad categories of activity, in which semantics are extracted, used, or both. Within these broad categories we survey dozens of major topics including fundamentals from the computer vision field and key robotics research areas utilizing semantics, including mapping, navigation and interaction with the world. The survey also covers key practical considerations, including enablers like increased data availability and improved computational hardware, and major application areas where...
... Beyond these flagship domains, semantics have also been investigated in a range of other subdomains. Ramirez-Amaro et al. [31] reviewed the use of semantics in the context of understanding human actions and activities, to enable a robot to execute a task. They classified semanticsbased methods for recognition into four categories: syntactic methods based on symbols and rules, affordance-based understanding of objects in the environment, graph-based encoding of complex variable relations, and knowledge-based methods. ...
... In the context of robotic vision, Ramirez-Amaro et al. [31] present a recent survey of the most representative approaches using semantic descriptions for recognition of human activities, with the intention of subsequent execution by robots. In such a context it is advantageous to characterize human movement through multiple levels of abstraction (in this particular case four) from high-level processes and tasks to low-level activities and primitive actions. ...
Book
For robots to navigate and interact more richly with the world around them, they will likely require a deeper understanding of the world in which they operate. In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world ‘mean’ to a robot, and is strongly tied to the question of how to represent that meaning. With humans and robots increasingly operating in the same world, the prospects of human-robot interaction also bring semantics and ontology of natural language into the picture. Driven by need, as well as by enablers like increasing availability of training data and computational resources, semantics is a rapidly growing research area in robotics. The field has received significant attention in the research literature to date, but most reviews and surveys have focused on particular aspects of the topic: the technical research issues regarding its use in specific robotic topics like mapping or segmentation, or its relevance to one particular application domain like autonomous driving. A new treatment is therefore required, and is also timely because so much relevant research has occurred since many of the key surveys were published. This survey provides an overarching snapshot of where semantics in robotics stands today. We establish a taxonomy for semantics research in or relevant to robotics, split into four broad categories of activity in which semantics are extracted, used, or both. Within these broad categories, we survey dozens of major topics including fundamentals from the computer vision field and key robotics research areas utilizing semantics such as mapping, navigation and interaction with the world. The survey also covers key practical considerations, including enablers like increased data availability and improved computational hardware, and major application areas where semantics is or is likely to play a key role. In creating this survey, we hope to provide researchers across academia and industry with a comprehensive reference that helps facilitate future research in this exciting field.
... Beyond these flagship domains, semantics have also been investigated in a range of other subdomains. Ramirez-Amaro et al. [31] reviewed the use of semantics in the context of understanding human actions and activities, to enable a robot to execute a task. They classified semanticsbased methods for recognition into four categories: syntactic methods based on symbols and rules, affordance-based understanding of objects in the environment, graph-based encoding of complex variable relations, and knowledge-based methods. ...
... In the context of robotic vision, Ramirez-Amaro et al. [31] present a recent survey of the most representative approaches using semantic descriptions for recognition of human activities, with the intention of subsequent execution by robots. In such a context it is advantageous to characterize human movement through multiple levels of abstraction (in this particular case four) from high-level processes and tasks to low-level activities and primitive actions. ...
... Other approaches interpret natural language through human-robot dialog (Thomason et al., 2015), or utilize additional sensor modalities, such as vision (Sun et al., 2021;Chen et al., 2021). Research has also targeted semantics, both to understand the world and to execute robot actions within it (Ramirez-Amaro et al., 2019). Approaches specific to learning or assigning the semantics of assembly tasks can be found in (Stenmark and Malec, 2014;Savarimuthu et al., 2017). ...
Article
Full-text available
Human-robot collaboration is gaining more and more interest in industrial settings, as collaborative robots are considered safe and robot actions can be programmed easily by, for example, physical interaction. Despite this, robot programming mostly focuses on automated robot motions and interactive tasks or coordination between human and robot still requires additional developments. For example, the selection of which tasks or actions a robot should do next might not be known beforehand or might change at the last moment. Within a human-robot collaborative setting, the coordination of complex shared tasks, is therefore more suited to a human, where a robot would act upon requested commands.In this work we explore the utilization of commands to coordinate a shared task between a human and a robot, in a shared work space. Based on a known set of higher-level actions (e.g., pick-and-placement, hand-over, kitting) and the commands that trigger them, both a speech-based and graphical command-based interface are developed to investigate its use. While speech-based interaction might be more intuitive for coordination, in industrial settings background sounds and noise might hinder its capabilities. The graphical command-based interface circumvents this, while still demonstrating the capabilities of coordination. The developed architecture follows a knowledge-based approach, where the actions available to the robot are checked at runtime whether they suit the task and the current state of the world. Experimental results on industrially relevant assembly, kitting and hand-over tasks in a laboratory setting demonstrate that graphical command-based and speech-based coordination with high-level commands is effective for collaboration between a human and a robot.
... Semantic representation of the human action model in a human-robot collaborative environment is essential for agile planning and building digital twins (DT) in a hybrid workplace. Action level motion description and recognizing anticipatory actions have considered semantic knowledge for robot behavior development [1]. This semantic knowledge may endow collaborative robots with the cognitive capability to assist human workers by learning and adapting to assembly activities. ...
Conference Paper
Semantic representation of motions in a human-robot collaborative environment is essential for agile design and development of digital twins (DT) towards ensuring efficient collaboration between humans and robots in hybrid work systems, e.g., in assembly operations. Dividing activities into actions helps to further conceptualize motion models for predicting what a human intends to do in a hybrid work system. However, it is not straightforward to identify human intentions in collaborative operations for robots to understand and collaborate. This paper presents a concept for semantic representation of human actions and intention prediction using a flexible task ontology interface in the semantic data hub stored in a domain knowledge base. This semantic data hub enables the construction of a DT with corresponding reasoning and simulation algorithms. Furthermore, a knowledge-based DT concept is used to analyze and verify the presented use-case of Human-Robot Collaboration in assembly operations. The preliminary evaluation showed a promising reduction of time for assembly tasks, which identifies the potential to i) improve efficiency reflected by reducing costs and errors and ultimately ii) assist human workers in improving decision making. Thus the contribution of the current work involves a marriage of machine learning, robotics, and ontology engineering into DT to improve human-robot interaction and productivity in a collaborative production environment.
... Kostavelis et al. 20 proposed a method named the Interaction Unit analysis, which modeled human behaviors based on a Dynamic Bayesian Network, and tried to tackle the comprehensive representation of the high-level structure of human's behavior for the robot's low-level sensory input. Ramrez-Amaro et al. 21 presented semantic-based methods for understanding human movements in robotic applications. They made a segmentation of human activities and tried to identify the important features of actions. ...
Article
Full-text available
Human action segmentation and recognition from the continuous untrimmed sensor data stream is a challenging issue known as temporal action detection. This article provides a two-stream You Only Look Once-based network method, which fuses video and skeleton streams captured by a Kinect sensor, and our data encoding method is used to turn the spatiotemporal temporal action detection into a one-dimensional object detection problem in constantly augmented feature space. The proposed approach extracts spatial–temporal three-dimensional convolutional neural network features from video stream and view-invariant features from skeleton stream, respectively. Furthermore, these two streams are encoded into three-dimensional feature spaces, which are represented as red, green, and blue images for subsequent network input. We proposed the two-stream You Only Look Once-based networks which are capable of fusing video and skeleton information by using the processing pipeline to provide two fusion strategies, boxes-fusion or layers-fusion. We test the temporal action detection performance of two-stream You Only Look Once network based on our data set High-Speed Interplanetary Tug/Cocoon Vehicles-v1, which contains seven activities in the home environment and achieve a particularly high mean average precision. We also test our model on the public data set PKU-MMD that contains 51 activities, and our method also has a good performance on this data set. To prove that our method can work efficiently on robots, we transplanted it to the robotic platform and an online fall down detection experiment.
... There are few approaches that show promise in this direction. An extensive review of event recognition approaches is given in [6]. Most approaches with task driven goals only consider object localization for manipulation (often with simple structure). ...
Chapter
Full-text available
Assistance systems in production gain increased importance for industry, to tackle the challenges of mass customization and the demographic change. Common to these systems is the need for context awareness and understanding of the state of an assembly process. This paper presents an approach towards Event Recognition in manual assembly settings and introduces concepts to apply this key technology to the application areas of Quality Assurance, Worker Assistance, and Process Teaching.
... As mentioned earlier, to efficiently learn from humans, the robotic system should first understand the intent of the human demonstrations and then map the knowledge to its own embodiment to execute the task. A very comprehensive overview of these two problems is given in [2]. This section focus on the problem of understanding the human demonstrated tasks. ...
Article
Full-text available
To maintain productivity and deal with mass customization, manufacturing industries worldwide are increasingly installing collaborative robots. However, the problem of easily interacting and teaching tasks to a collaborative robot is still unsolved. We present a programming-framework that exploits and integrates different programming paradigms such as Learning-by-demonstration (LbD), Learning-by-interaction (LbI) and Learning-by-programming (LbP) using a semantically-enhanced-reasoning-system. The framework combines their individual advantages and alleviate drawbacks when learning complex assembly processes (AP). First, the AP are abstracted at task level and the sequence of the tasks are learned using the LbD approach. The reasoning engine semantically links the tasks to the AP (allows knowledge portability). Then, LbP approach is exploited to program the relevant skills to the robot for task execution. During robot execution, uncertainties are solved by iterative interaction with the user using a GUI based LbI approach. The framework is evaluated on a real-world use case and demonstrates promising results.
... There are few approaches that show promise in this direction. An extensive review of event recognition approaches is given in [9]. Most approaches with task driven goals only consider object localization for manipulation (often with simple structure). ...
Chapter
Industrial assistance systems are increasingly used in modern production facilities to support employees, in order to cope with varying assembly processes, increased complexity and to reduce mental and physical stress. However, programming and configuring such systems, to provide assistance in a given assembly process context, is a challenging task and usually requires extensive programming knowledge. This paper presents an approach, combining human event recognition together with a semantic knowledge processing framework in order to enable intuitive programming and configuration. The presented method includes learning of assembly process knowledge, based on human demonstration. We demonstrate the applicability of this approach in two use cases: process learning and transfer, and user guidance for manual assembly.
... Approaches that use higher-level features [37,38,39] seem to be less affected by this. In this context, Ramirez-Amaro and coworkers have tried to consider human movements recognition from a semantic point of view [40]. ...
Preprint
Full-text available
Predicting other people's action is key to successful social interactions, enabling us to adjust our own behavior to the consequence of the others' future actions. Studies on action recognition have focused on the importance of individual visual features of objects involved in an action and its context. Humans, however, recognize actions on unknown objects or even when objects are imagined (pantomime). Other cues must thus compensate the lack of recognizable visual object features. Here, we focus on the role of inter-object relations that change during an action. We designed a virtual reality setup and tested recognition speed for 10 different manipulation actions on 50 subjects. All objects were abstracted by emulated cubes so the actions could not be inferred using object information. Instead, subjects had to rely only on the information that comes from the changes in the spatial relations that occur between those cubes. In spite of these constraints, our results show the subjects were able to predict actions in, on average, less than 64% of the action's duration. We employed a computational model -an enriched Semantic Event Chain (eSEC)- incorporating the information of spatial relations, specifically (a) objects' touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects. Trained on the same actions as those observed by subjects, the model successfully predicted actions even better than humans. Information theoretical analysis shows that eSECs optimally use individual cues, whereas humans presumably mostly rely on a mixed-cue strategy, which takes longer until recognition. Providing a better cognitive basis of action recognition may, on one hand improve our understanding of related human pathologies and, on the other hand, also help to build robots for conflict-free human-robot cooperation. Our results open new avenues here.
Article
The construction industry is seeking a robotic revolution to meet increasing demands for productivity, quality, and safety. Typically, construction robots are usually pre-programmed for a single task, such as painting. Their behavior is fixed when they leave the factory. However, it is difficult to pre-program all capabilities (referred to as workplace skills) that construction workers may require. Construction robots are expected to have the same ability of skill learning as human apprentices, allowing them to acquire a wide range of workplace skills from experienced workers and eventually complete relevant construction tasks autonomously. However, workplace skill learning of robots has rarely been investigated in the construction industry. This survey reviews state-of-the-art approaches to help robots learn skills from human demonstrations. To begin, the workplace skill is represented as ‘Know That’ and ‘Know How’ problems. ‘Know That’ is a high-level task planning ability aimed at understanding human activities from demonstrations. ‘Know How’ refers to the ability to learn specific actions for completing the construction task. Sematic methods and learn from demonstration (LfD) methods are reviewed to tackle these two problems. Finally, we discuss the open issues of past research, present future directions, and highlight the survey’s knowledge contributions. We believe that this survey will provide a new perspective on robots in the construction industry and inspire more discussions about skill learning of construction robots.
Article
Full-text available
Human robot collaborative assemblies, where humans and robots work together, are becoming the new frontier in industrial robotics. However, the aspect of humans to easily and intuitively interact/collaborate with a robot, especially for teaching new tasks, is still open. In this work, we present a semantic knowledge-based-reasoning framework that first learns to recognize human activities using perception data (action recognition and object tracking) and semantically links them to an assembly process. Thus deriving the intention of the human interaction in the teaching process. The framework then extracts relevant parameters for the robot action execution using an interactive skill-based programming approach. To resolve ambiguities during the teaching process, the reasoning framework initiates an interaction at an (semantically) abstract level with the user by exploiting previous knowledge and the current environmental setup. The reasoning framework is demonstrated in two different application scenarios involving human-robot collaborative teaching and a user guidance system respectively.
Article
Full-text available
Reasoning about the meanings of human activities is a powerful way for robots to learn from humans.
Conference Paper
Full-text available
We build upon the functional object-oriented network (FOON), a structured knowledge representation which is constructed from observations of human activities and manipulations. A FOON can be used for representing object-motion affordances. Knowledge retrieval through graph search allows us to obtain novel manipulation sequences using knowledge spanning across many video sources, hence the novelty in our approach. However, we are limited to the sources collected. To further improve the performance of knowledge retrieval as a follow up to our previous work, we discuss generalizing knowledge to be applied to objects which are similar to what we have in FOON without manually annotating new sources of knowledge. We discuss two means of generalization: 1) expanding our network through the use of object similarity to create new functional units from those we already have, and 2) compressing the functional units by object categories rather than specific objects. We discuss experiments which compare the performance of our knowledge retrieval algorithm with both expansion and compression by categories.
Article
Full-text available
We present a framework that allows a robot manipulator to learn how to execute structured tasks from human demonstrations. The proposed system combines physical human–robot interaction with attentional supervision in order to support kinesthetic teaching, incremental learning, and cooperative execution of hierarchically structured tasks. In the proposed framework, the human demonstration is automatically segmented into basic movements, which are related to a task structure by an attentional system that supervises the overall interaction. The attentional system permits to track the human demonstration at different levels of abstraction and supports implicit non-verbal communication both during the teaching and the execution phase. Attention manipulation mechanisms (e.g. object and verbal cueing) can be exploited by the teacher to facilitate the learning process. On the other hand, the attentional system permits flexible and cooperative task execution. The paper describes the overall system architecture and details how cooperative tasks are learned and executed. The proposed approach is evaluated in a human–robot co-working scenario, showing that the robot is effectively able to rapidly learn and flexibly execute structured tasks.
Article
Full-text available
We propose an approach for instructing a robot using natural language to solve complex tasks in a dynamic environment. In this study, we elaborate on a framework that allows a humanoid robot to understand natural language, derive symbolic representations of its sensorimotor experience, generate complex plans according to the current world state, and monitor plan execution. The presented development supports replacing missing objects and suggesting possible object locations. It is a realization of the concept of structural bootstrapping developed in the context of the European project Xperience. The framework is implemented within the robot development environment ArmarX. We evaluate the framework on the humanoid robot ARMAR-III in the context of two experiments: a demonstration of the real execution of a complex task in the kitchen environment on ARMAR-III and an experiment with untrained users in a simulation environment.
Article
Full-text available
Data sets is crucial not only for model learning and evaluation but also to advance knowledge on human behavior, thus fostering mutual inspiration between neuroscience and robotics. However, choosing the right data set to use or creating a new data set is not an easy task, because of the variety of data that can be found in the related literature. The first step to tackle this issue is to collect and organize those that are available. In this work, we take a significant step forward by reviewing data sets that were published in the past 10 years and that are directly related to object manipulation and grasping. We report on modalities, activities, and annotations for each individual data set and we discuss our view on its use for object manipulation. We also compare the data sets and summarize them. Finally, we conclude the survey by providing suggestions and discussing the best practices for the creation of new data sets.
Conference Paper
Full-text available
In this work we present a novel method that generates compact semantic models for inferring human coordinated activities, including tasks that require the understanding of dual arms sequencing. These models are robust and invariant to observation from different executions styles of the same activity. Additionally, the obtained semantic representations are able to re-use the acquired knowledge to infer different types of activities. Furthermore, our method is capable to infer dual-arm co-manipulation activities and it considers the correct synchronization between the inferred activities to achieve the desired common goal. We propose a system that, rather than focusing on the different execution styles, extracts the meaning of the observed task by means of semantic representations. The proposed method is a hierarchical approach that first extracts the relevant information from the observations. Then, it infers the observed human activities based on the obtained semantic representations. After that, these inferred activities can be used to trigger motion primitives in a robot to execute the demonstrated task. In order to validate the portability of our system, we have evaluated our semantic-based method on two different humanoid platforms, the iCub robot and REEM-C robot. Demonstrating that our system is capable to correctly segment and infer on-line the observed activities with an average accuracy of 84.8%.
Article
Full-text available
Teaching robots object manipulation skills is a complex task that involves multimodal perception and knowledge about processing the sensor data. In this paper, we show a concept for humanoid robots in household environments with a variety of related objects and actions. Following the paradigms of Programming by Demonstration (PbD), we provide a flexible approach that enables a robot to adaptively reproduce an action sequence demonstrated by a human. The obtained human motion data with involved objects is segmented into semantic conclusive sub-actions by the detection of relations between the objects and the human actor. Matching actions are chosen from a library of Object-Action Complexes (OACs) using the preconditions and effects of each sub-action. The resulting sequence of OACs is parameterized for the execution on a humanoid robot depending on the observed action sequence and on the state of the environment during execution. The feasibility of this approach is shown in an exemplary kitchen scenario, where the robot has to prepare a dough.
Article
Full-text available
In this study, we present a framework that infers human activities from observations using semantic representations. The proposed framework can be utilized to address the difficult and challenging problem of transferring tasks and skills to humanoid robots. We propose a method that allows robots to obtain and determine a higher-level understanding of a demonstrator’s behavior via semantic representations. This abstraction from observations captures the “essence” of the activity, thereby indicating which aspect of the demonstrator’s actions should be executed in order to accomplish the required activity. Thus, a meaningful semantic description is obtained in terms of human motions and object properties. In addition, we validated the semantic rules obtained in different conditions, i.e., three different and complex kitchen activities: 1) making a pancake; 2) making a sandwich; and 3) setting the table. We present quantitative and qualitative results, which demonstrate that without any further training, our system can deal with time restrictions, different execution styles of the same task by several participants, and different labeling strategies. This means, the rules obtained from one scenario are still valid even for new situations, which demonstrates that the inferred representations do not depend on the task performed. The results show that our system correctly recognized human behaviors in real-time in around 87.44% of cases, which was even better than a random participant recognizing the behaviors of another human (about 76.68%). In particular, the semantic rules acquired can be used to effectively improve the dynamic growth of the ontology-based knowledge representation. Hence, this method can be used flexibly across different demonstrations and constraints to infer and achieve a similar goal to that observed. Furthermore, the inference capability introduced in this study was integrated into a joint space control loop for a humanoid robot, an iCub, for achieving similar goals to the human demonstrator online.
Conference Paper
Full-text available
Activity recognition (AR) systems are typically built to recognize a predefined set of common activities. However, these systems need to be able to learn new activities to adapt to a user's needs. Learning new activities is especially challenging in practical scenarios when a user provides only a few annotations for training an AR model. In this work, we study the problem of recognizing new activities with a limited amount of labeled training data. Due to the shortage of labeled data, small variations of the new activity will not be detected resulting in a significant degradation of the system's recall. We propose the FEAT (Feature-based and Attribute-based learning) approach, which leverages the relationship between existing and new activities to compensate for the shortage of the labeled data. We evaluate FEAT on three public datasets and demonstrate that it outperforms traditional AR approaches in recognizing new activities, especially when only a few training instances are available.
Conference Paper
Full-text available
In this paper we present a formal computational framework for modeling manipulation actions. The introduced formalism leads to semantics of manipulation action and has applications to both observing and understanding human manipulation actions as well as executing them with a robotic mechanism (e.g. a humanoid robot). It is based on a Combinatory Categorial Grammar. The goal of the introduced framework is to: (1) represent manipulation actions with both syntax and semantic parts, where the semantic part employs λ-calculus; (2) enable a probabilis-tic semantic parsing schema to learn the lambda-calculus representation of manipulation action from an annotated action corpus of videos; (3) use (1) and (2) to develop a system that visually observes manipulation actions and understands their meaning while it can reason beyond observations using propositional logic and axiom schemata. The experiments conducted on a public available large manipulation action dataset validate the theoretical framework and our implementation.
Conference Paper
Full-text available
We present a large-scale whole-body human motion database consisting of captured raw motion data as well as the corresponding post-processed motions. This database serves as a key element for a wide variety of research questions related e.g. to human motion analysis, imitation learning, action recognition and motion generation in robotics. In contrast to previous approaches, the motion data in our database considers the motions of the observed human subject as well as the objects with which the subject is interacting. The information about human-object relations is crucial for the proper understanding of human actions and their goal-directed reproduction on a robot. To facilitate the creation and processing of human motion data, we propose procedures and techniques for capturing of motion, labeling and organization of the motion capture data based on a Motion Description Tree, as well as for the normalization of human motion to an unified representation based on a reference model of the human body. We provide software tools and interfaces to the database allowing access and efficient search with the proposed motion representation.
Conference Paper
Full-text available
In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by " watching " unconstrained videos with high accuracy.
Conference Paper
Full-text available
Dynamic Movement Primitives (DMPs) are a common method for learning a control policy for a task from demonstration. This control policy consists of differential equations that can create a smooth trajectory to a new goal point. However, DMPs only have a limited ability to generalize the demonstration to new environments and solve problems such as obstacle avoidance. Moreover, standard DMP learning does not cope with the noise inherent to human demonstrations. Here, we propose an approach for robot learning from demonstration that can generalize noisy task demonstrations to a new goal point and to an environment with obstacles. This strategy for robot learning from demonstration results in a control policy that incorporates different types of learning from demonstration, which correspond to different types of observational learning as outlined in developmental psychology.
Article
Full-text available
Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.
Article
Full-text available
This paper describes the architecture of a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and that includes interacting modules for perception and reasoning. Our work contributes to two core problems at the heart of action understanding: (a) the grounding of relevant information about actions in perception (the perception-action integration problem), and (b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Grammar, a context-free grammar that organizes actions as a sequence of sub events. Each sub event is described by the hand, movements, objects and tools involved, and the relevant information about these factors is obtained from biologically-inspired perception modules. These modules track the hands and objects, and they recognize the hand grasp, objects and actions using attention, segmentation, and feature description. Experiments on a new data set of manipulation actions show that our system extracts the relevant visual information and semantic representation. This representation could further be used by the cognitive agent for reasoning, prediction, and planning.
Conference Paper
Full-text available
Advancements in Virtual Reality have enabled well-defined and consistent virtual environments that can cap-ture complex scenarios, such as human everyday activities. Additionally, virtual simulators (such as SIGVerse) are designed to be user-friendly mechanisms between virtual robots/agents and real users allowing a better interaction. We envision such rich scenarios can be used to train robots to learn new behaviors specially in human everyday activities where a diverse variability can be found. In this paper, we present a multi-level framework that is capable to use different input sources such as cameras and virtual environments to understand and execute the demonstrated activities. Our presented framework first obtains the semantic models of human activities from cameras, which are later tested using the SIGVerse virtual simulator to show new complex activities (such as, cleaning the table) using a virtual robot. Our introduced framework is integrated on a real robot, i.e. an iCub, which is capable to process the signals from the virtual environment to then understand the activities performed by the observed robot. This was realized through the use of previous knowledge and experiences that the robot has learned from observing humans activities. Our results show that our framework was able to extract the meaning of the observed motions with 80% accuracy of recognition by obtaining the objects relationships given the current context via semantic representations to extract high-level understanding of those complex activities even when they represent different behaviors.
Conference Paper
Full-text available
Automatically segmenting and recognizing human activities from observations typically requires a very complex and sophisticated perception algorithm. Such systems would be unlikely implemented on-line into a physical system, such as a robot, due to the pre-processing step(s) that those vision systems usually demand. In this work, we present and demonstrate that with an appropriate semantic representation of the activity, and without such complex perception systems, it is sufficient to infer human activities from videos. First, we will present a method to extract the semantic rules based on three simple hand motions, i.e. move, not move and tool use. Additionally, the information of the object properties either ObjectActedOn or ObjectInHand are used. Such properties encapsulate the information of the current context. The above data is used to train a decision tree to obtain the semantic rules employed by a reasoning engine. This means, we extract lower-level information from videos and we reason about the intended human behaviors (high-level). The advantage of the abstract representation is that it allows to obtain more generic models out of human behaviors, even when the information is obtained from different scenarios. The results show that our system correctly segments and recognizes human behaviors with an accuracy of 85%. Another important aspect of our system is its scalability and adaptability toward new activities, which can be learned on-demand. Our system has been fully implemented on a humanoid robot, the iCub to experimentally validate the performance and the robustness of our system during on-line execution of the robot.
Article
This paper presents the results of the Artificial Intelligence (AI) method developed during the European project “Factory-in-a-day”. Advanced AI solutions, as the one proposed, allow a natural Human–Robot-collaboration, which is an important capability of robots in industrial warehouses. This new generation of robots is expected to work in heterogeneous production lines by efficiently interacting and collaborating with human co-workers in open and unstructured dynamic environments. For this, robots need to understand and recognize the demonstrations from different operators. Therefore, a flexible and modular process to program industrial robots has been developed based on semantic representations. This novel learning by demonstration method enables non-expert operators to program new tasks on industrial robots.
Article
The automation of production lines in industrial scenarios implies solving different problems such as the flexibility to deploy robotic solutions to different production lines; usability to allow non-robotics expert users to teach robots different tasks; and safety to enable operators to physically interact with robots without the need of fences. In this paper, we present a system that integrates three novel technologies to address the above mentioned problems. We use an auto-calibrated multi-modal robot skin, a general robot control framework to generate dynamic behaviors fusing multiple sensor signals, and an intuitive and fast teaching by demonstration method based on semantic reasoning. We validate the proposed technologies with a wheeled humanoid robot in an industrial set-up. The benefits of our system are the transferability of the learned tasks to different robots, the re-usability of the models when new objects are introduced in the production line, the capability of detecting and recovering from errors, and the reliable detection of collisions and pre-collisions to provide a fast reactive robot which improves the physical human-robot interaction.
Article
Imitation learning is a powerful paradigm for robot skill acquisition. However, obtaining demonstrations suitable for learning a policy that maps from raw pixels to actions can be challenging. In this paper we describe how consumer-grade Virtual Reality headsets and hand tracking hardware can be used to naturally teleoperate robots to perform complex tasks. We also describe how imitation learning can learn deep neural network policies (mapping from pixels to actions) that can acquire the demonstrated skills. Our experiments showcase the effectiveness of our approach for learning visuomotor skills.
Article
Neuroscience studies have shown that incorporating gaze view with third view perspective has a great influence to correctly infer human behaviors. Given the importance of both first and third person observations for the recognition of human behaviors, we propose a method that incorporates these observations in a technical system to enhance the recognition of human behaviors, thus improving beyond third person observations in a more robust human activity recognition system. First, we present the extension of our proposed semantic reasoning method by including gaze data and external observations as inputs to segment and infer human behaviors in complex real-world scenarios. Then, from the obtained results we demonstrate that the combination of gaze and external input sources greatly enhance the recognition of human behaviors. Our findings have been applied to a humanoid robot to online segment and recognize the observed human activities with better accuracy when using both input sources; for example, the activity recognition increases from 77% to 82% in our proposed pancake-making dataset. To provide completeness of our system, we have evaluated our approach with another dataset with a similar setup as the one proposed in this work, that is, the CMU-MMAC dataset. In this case, we improved the recognition of the activities for the egg scrambling scenario from 54% to 86% by combining the external views with the gaze information, thus showing the benefit of incorporating gaze information to infer human behaviors across different datasets.
Article
We present a three-level cognitive system in a learning by demonstration context. The system allows for learning and transfer on the sensorimotor level as well as the planning level. The fundamentally different data structures associated with these two levels are connected by an efficient mid-level representation based on so-called "semantic event chains." We describe details of the representations and quantify the effect of the associated learning procedures for each level under different amounts of noise. Moreover, we demonstrate the performance of the overall system by three demonstrations that have been performed at a project review. The described system has a technical readiness level (TRL) of 4, which in an ongoing follow-up project will be raised to TRL 6.
Article
For collaborative robots to become useful, end users who are not robotics experts must be able to instruct them to perform a variety of tasks. With this goal in mind, we developed a system for end-user creation of robust task plans with a broad range of capabilities. CoSTAR: the Collaborative System for Task Automation and Recognition is our winning entry in the 2016 KUKA Innovation Award competition at the Hannover Messe trade show, which this year focused on Flexible Manufacturing. CoSTAR is unique in how it creates natural abstractions that use perception to represent the world in a way users can both understand and utilize to author capable and robust task plans. Our Behavior Tree-based task editor integrates high-level information from known object segmentation and pose estimation with spatial reasoning and robot actions to create robust task plans. We describe the cross-platform design and implementation of this system on multiple industrial robots and evaluate its suitability for a wide variety of use cases.
Conference Paper
The development of breakthrough technologies helps the deployment of robotic systems in the industry. The implementation and integration of such technologies will improve productivity, flexibility and competitiveness, in diverse industrial settings specially for small and medium enterprises. In this paper we present a framework that integrates three novel technologies, namely safe robot arms with multi-modal and auto-calibrated sensing skin, a robot control framework to generate dynamic behaviors fusing multiple sensor signals, and an intuitive and fast teaching by demonstration method that segments and recognizes the robot activities on-line based on re-usable semantic descriptions. In order to validate our framework, these technologies are integrated in a industrial setting to sort and pack fruits. We demonstrate that our presented framework enables a standard industrial robotic system to be flexible, modular and adaptable to different production requirements.
Article
A dataset is crucial for model learning and evaluation. Choosing the right dataset to use or making a new dataset requires the knowledge of those that are available. In this work, we provide that knowledge, by reviewing twenty datasets that were published in the recent six years and that are directly related to object manipulation. We report on modalities, activities, and annotations for each individual dataset and give our view on its use for object manipulation. We also compare the datasets and summarize them. We conclude with our suggestion on future datasets.
Article
Given semantic descriptions of object classes, zero-shot learning aims to accurately recognize objects of the unseen classes, from which no examples are available at the training stage, by associating them to the seen classes, from which labeled examples are provided. We propose to tackle this problem from the perspective of manifold learning. Our main idea is to align the semantic space that is derived from external information to the model space that concerns itself with recognizing visual features. To this end, we introduce a set of "phantom" object classes whose coordinates live in both the semantic space and the model space. Serving as bases in a dictionary, they can be optimized from labeled data such that the synthesized real object classifiers achieve optimal discriminative performance. We demonstrate superior accuracy of our approach over the state of the art on four benchmark datasets for zero-shot learning, including the full ImageNet Fall 2011 dataset with more than 20,000 unseen classes.
Conference Paper
One of the main challenges in learning fine-grained visual categories is gathering training images. Recent work in Zero-Shot Learning (ZSL) circumvents this challenge by describing categories via attributes or text. However, not all visual concepts, e.g., two people dancing, are easily amenable to such descriptions. In this paper, we propose a new modality for ZSL using visual abstraction to learn difficult-to-describe concepts. Specifically, we explore concepts related to people and their interactions with others. Our proposed modality allows one to provide training data by manipulating abstract visualizations, e.g., one can illustrate interactions between two clipart people by manipulating each person’s pose, expression, gaze, and gender. The feasibility of our approach is shown on a human pose dataset and a new dataset containing complex interactions between two people, where we outperform several baselines. To better match across the two domains, we learn an explicit mapping between the abstract and real worlds.
Conference Paper
Reasoning about objects and their affordances is a fundamental problem for visual intelligence. Most of the previous work casts this problem as a classification task where separate classifiers are trained to label objects, recognize attributes, or assign affordances. In this work, we consider the problem of object affordance reasoning using a knowledge base representation. Diverse information of objects are first harvested from images and other meta-data sources. We then learn a knowledge base (KB) using a Markov Logic Network (MLN). Given the learned KB, we show that a diverse set of visual inference tasks can be done in this unified framework without training separate classifiers, including zero-shot affordance prediction and object recognition given human poses.
Article
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources - such as text data - both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.
Article
We are exploring how primitives, small units of behavior, can speed up robot learning and enable robots to learn difficult dynamic tasks in reasonable amounts of time. In this chapter we describe work on learning from observation and learning from practice on air hockey and marble maze tasks. We discuss our research strategy, results, and open issues and challenges. Primitives are units of behavior above the level of motor or muscle commands. There have been many proposals for such units of behavior in neuroscience, psychology, robotics, artificial intelligence and machine learning (Arkin, 1998; Schmidt, 1988; Schmidt, 1975; Russell and Norvig, 1995; Barto and Mahadevan, 2003). There is a great deal of evidence that biological systems have units of behavior above the level of activating individual motor neurons, and that the organization of the brain reflects those units of behavior (Loeb, 1989). We know that in human eye movement, for example, there are only a few types of movements including saccades, smooth pursuit, vestibular ocular reflex (VOR), optokinetic nystagmus (OKN) and vergence, that general eye movements are generated as sequences of these behavioral units, and that there are distinct brain regions dedicated to generating and controlling each type of eye movement (Carpenter, 1988). We know that there are discrete locomotion patterns, or gaits, for animals with legs (McMahon, 1984). Whether there are corresponding units of behavior for upper limb movement in humans and other primates is not yet clear. © Cambridge University Press 2007 and Cambridge University Press, 2009.
Article
We consider the problem of detecting past activities as well as anticipating which activity will happen in the future and how. We start by modeling the rich spatio-temporal relations between human poses and objects (called affordances) using a conditional random field (CRF). However, because of the ambiguity in the temporal segmentation of the sub-activities that constitute an activity, in the past as well as in the future, multiple graph structures are possible. In this paper, we reason about these alternate possibilities by reasoning over multiple possible graph structures. We obtain them by approximating the graph with only additive features, which lends to efficient dynamic programming. Starting with this proposal graph structure, we then design moves to obtain several other likely graph structures. We then show that our approach improves the state-of-the-art significantly for detecting past activities as well as for anticipating future activities, on a dataset of 120 activity videos collected from four subjects.
Article
Being aware of the context is one the important requirements of Cyber-Physical Systems (CPS). Context-aware systems have the capability to sense what is happening or changing in their environment and take appropriate actions to adapt to the changes. In this chapter, we present a technique for identifying the focus of attention in a context-aware cyber-physical system. We propose to use first-person vision, obtained through wearable gaze-directed camera that can capture the scene through the wearer's point-of-view. We use the fact that human cognition is linked to his gaze and typically the object/person of interest holds our gaze. We argue that our technique is robust and works well in the presence of noise and other distracting signals, where the conventional techniques of IR sensors and tagging fail. Moreover, the technique is unobtrusive and does not pollute the environment with unnecessary signals. Our approach is general in that it may be applied to a generic CPS like healthcare, office and industrial scenarios and also in intelligent homes.
Conference Paper
The need for combined task and motion planning in robotics is well understood. Solutions to this problem have typically relied on special purpose, integrated implementations of task planning and motion planning algorithms. We propose a new approach that uses off-the-shelf task planners and motion planners and makes no assumptions about their implementation. Doing so enables our approach to directly build on, and benefit from, the vast literature and latest advances in task planning and motion planning. It uses a novel representational abstraction and requires only that failures in computing a motion plan for a high-level action be identifiable and expressible in the form of logical predicates at the task level. We evaluate the approach and illustrate its robustness through a number of experiments using a state-of-the-art robotics simulator and a PR2 robot. These experiments show the system accomplishing a diverse set of challenging tasks such as taking advantage of a tray when laying out a table for dinner and picking objects from cluttered environments where other objects need to be re-arranged before the target object can be reached.
Article
In this paper, we propose an approach for learning task specifications automatically, by observing human demonstrations. Using this approach allows a robot to combine representations of individual actions to achieve a high-level goal. We hypothesize that task specifications consist of variables that present a pattern of change that is invariant across demonstrations. We identify these specifications at different stages of task completion. Changes in task constraints allow us to identify transitions in the task description and to segment them into subtasks. We extract the following task-space constraints: 1) the reference frame in which to express the task variables; 2) the variable of interest at each time step, position, or force at the end effector; and 3) a factor that can modulate the contribution of force and position in a hybrid impedance controller. The approach was validated on a seven-degree-of-freedom Kuka arm, performing two different tasks: grating vegetables and extracting a battery from a charging stand.
Article
Making future autonomous robots capable of accomplishing human-scale manipulation tasks requires us to equip them with knowledge and reasoning mechanisms. We propose Open-EASE, a remote knowledge representation and processing service that aims at facilitating these capabilities. Open-EASE gives its users unprecedented access to the knowledge of leading-edge autonomous robotic agents. It also provides the representational infrastructure to make inhomogeneous experience data from robots and human manipulation episodes semantically accessible, and is complemented by a suite of software tools that enable researchers and robots to interpret, analyze, visualize, and learn from the experience data. Using Open-EASE users can retrieve the memorized experiences of manipulation episodes and ask queries regarding to what the robot saw, reasoned, and did as well as how the robot did it, why, and what effects it caused.
Article
An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for reactive responses. Furthermore, anticipation can even improve the detection accuracy of past activities. The challenge, however, is two-fold: We need to capture the rich context for modeling the activities and object affordances, and we need to anticipate the distribution over a large space of future human activities. In this work, we represent each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances. We then consider each ATCRF as a particle and represent the distribution over the potential futures using a set of particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, we first show that anticipation improves the state-ofthe- art detection results. We then show that for new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 84.1%, 74.4% and 62.2% for an anticipation time of 1, 3 and 10 seconds respectively. Finally, we also show a robot using our algorithm for performing a few reactive responses.
Article
This paper develops a general policy for learning the relevant features of an imitation task. We restrict our study to imitation of manipulative tasks or gestures. The imitation process is modeled as a hierarchical optimization system, which minimizes the discrepancy between two multi-dimensional datasets. To classify across manipulation strategies, we apply a probabilistic analysis to data in Cartesian and joint spaces. We determine a general metric that optimizes the policy of task reproduction, following strategy determination. The model successfully discovers strategies in six different imitative tasks and controls task reproduction by a full body humanoid robot.