Conference Paper

RobotVQA -A Scene-Graph-and Deep-Learning-based Visual Question Answering System for Robot Manipulation *

To read the full-text of this research, you can request a copy directly from the authors.


Visual robot perception has been challenging to successful robot manipulation in noisy, cluttered and dynamic environments. While some perception systems fail to provide an adequate semantics of the scene, others fail to present appropriate learning models and training data. Another major issue encountered in some robot perception systems is their inability to promptly respond to robot control programs whose realtimeness is crucial. This paper proposes an architecture to robot vision for manipulation tasks that addresses the three issues mentioned above. The architecture encompasses a generator of training datasets and a learnable scene describer, coined as RobotVQA for Robot Visual Question Answering. The architecture leverages the power of deep learning to predict and photo-realistic virtual worlds to train. RobotVQA takes as input a robot scene's RGB or RGBD image, detects all relevant objects in it, then describes in realtime each object in terms of category, color, material, shape, openability, 6D-pose and segmentation mask. Moreover, RobotVQA computes the qualitative spatial relations among those objects. We refer to such a scene description in this paper as scene graph or semantic graph of the scene. In RobotVQA, prediction and training take place in a unified manner. Finally, we demonstrate how RobotVQA is suitable for robot control systems that interpret perception as a question answering process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Kenfack et al. intersect Visual Question Answering (VQA) and robotics and develop a robotVQA architecture. RobotVQA provides semantically-grounded answers to questions about a scene with the motivation to illicit more meaningful robot object manipulations in the future [15]. Scene graphs have also been utilized to mitigate safety risks in human-robot-collaboration scenarios [19], [12]. ...
... Depending on the predicted label, the rank of one or both relationships is incremented (lines 5-10). The resulting list of relationships L r , sorted by rank, is returned by the algorithm (line [14][15]. Note, annotation of a training label 0, 1 or 2 is determined via domain knowledge of the failure scenario. ...
When interacting in unstructured human environments, occasional robot failures are inevitable. When such failures occur, everyday people, rather than trained technicians, will be the first to respond. Existing natural language explanations hand-annotate contextual information from an environment to help everyday people understand robot failures. However, this methodology lacks generalizability and scalability. In our work, we introduce a more generalizable semantic explanation framework. Our framework autonomously captures the semantic information in a scene to produce semantically descriptive explanations for everyday users. To generate failure-focused explanations that are semantically grounded, we leverages both semantic scene graphs to extract spatial relations and object attributes from an environment, as well as pairwise ranking. Our results show that these semantically descriptive explanations significantly improve everyday users' ability to both identify failures and provide assistance for recovery than the existing state-of-the-art context-based explanations.
... To construct the relation expert layers to expand the learning capabilities of original build-in scene graph generation methods, we utilize Mixture-of-Experts to build up ( ( | , , )), including sets of relation encoders/decoders experts, such as "Encoder A/Decoder A" and the final relation classifiers experts, such as "Classifier A". [24]. The advances of SGG brought the potential down-stream applications on SGG like image captioning [25], [26], [27], robotic applications [28], [29], [30], [31], [32], image manipulation [33], cross-media retrieval [34]. With the message passing mechanism, the neural motif is proposed and shows the advantages of LSTM based network [35]. ...
Full-text available
The scene graph generation has gained tremendous progress in recent years. However, its intrinsic long-tailed distribution of predicate classes is a challenging problem. Almost all existing scene graph generation (SGG) methods follow the same framework where they use a similar backbone network for object detection and a customized network for scene graph generation. These methods often design the sophisticated context-encoder to extract the inherent relevance of scene context w.r.t the intrinsic predicates and complicated networks to improve the learning capabilities of the network model for highly imbalanced data distributions. To address the unbiased SGG problem, we present a simple yet effective method called Context-Aware Mixture-of-Experts (CAME) to improve the model diversity and alleviate the biased SGG without a sophisticated design. Specifically, we propose to use the mixture of experts to remedy the heavily long-tailed distributions of predicate classes, which is suitable for most unbiased scene graph generators. With a mixture of relation experts, the long-tailed distribution of predicates is addressed in a divide and ensemble manner. As a result, the biased SGG is mitigated and the model tends to make more balanced predicates predictions. However, experts with the same weight are not sufficiently diverse to discriminate the different levels of predicates distributions. Hence, we simply use the build-in context-aware encoder, to help the network dynamically leverage the rich scene characteristics to further increase the diversity of the model. By utilizing the context information of the image, the importance of each expert w.r.t the scene context is dynamically assigned. We have conducted extensive experiments on three tasks on the Visual Genome dataset to show that came achieved superior performance over previous methods.
... Kumar et al. propose a learning framework with GNN-based scene generation to teach a robotic agent to interactively explore cluttered scenes. Kenfack et al. propose RobotVQA, which can generate the scene graph to construct a complete and structured description of the cluttered scene for the manipulation task [8]. ...
Full-text available
The ability to handle objects in cluttered environment has been long anticipated by robotic community. However, most of works merely focus on manipulation instead of rendering hidden semantic information in cluttered objects. In this work, we introduce the scene graph for embodied exploration in cluttered scenarios to solve this problem. To validate our method in cluttered scenario, we adopt the Manipulation Question Answering (MQA) tasks as our test benchmark, which requires an embodied robot to have the active exploration ability and semantic understanding ability of vision and language.As a general solution framework to the task, we propose an imitation learning method to generate manipulations for exploration. Meanwhile, a VQA model based on dynamic scene graph is adopted to comprehend a series of RGB frames from wrist camera of manipulator along with every step of manipulation is conducted to answer questions in our framework.The experiments on of MQA dataset with different interaction requirements demonstrate that our proposed framework is effective for MQA task a representative of tasks in cluttered scenario.
... Kenfack et al. propose a RobotVQA system, which can generate a semantic scene graph of the observed scenario. The robot can then resort to the scene graph to manipulate the object more efficiently [18]. In a canonical task, rearrangement, the robot is required to change a given environment to a specified state by manipulation given different sources of information such as object poses, images, language description to understand the environment [5]. ...
... • Since there is no single algorithm that can solve all perception tasks, RoboSherlock uses an ensemble of experts approach, where each expert solves a specific problem, e.g., color segmentation, 3D registration, transparent object perception, etc. One of the crucial experts is RobotVQA [16], which is a scene-graph and Deep-Learning-based visual question answering system for robot manipulation. At the heart of RobotVQA lies a multi-task Deep-Learning model that extends Mask-RCNN, and infers formal semantic scene graphs from RGB(D) images of the scene at a rate of ≈ 5 fps with a space requirement of 5.5GB. ...
In this paper, we present an experiment, designed to investigate and evaluate the scalability and the robustness aspects of mobile manipulation. The experiment involves performing variations of mobile pick and place actions and opening/closing environment containers in a human household. The robot is expected to act completely autonomously for extended periods of time. We discuss the scientific challenges raised by the experiment as well as present our robotic system that can address these challenges and successfully perform all the tasks of the experiment. We present empirical results and the lessons learned as well as discuss where we hit limitations.
Full-text available
Over the last years deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases. This review paper provides a brief overview of some of the most significant deep learning schemes used in computer vision problems, that is, Convolutional Neural Networks, Deep Boltzmann Machines and Deep Belief Networks, and Stacked Denoising Autoencoders. A brief account of their history, structure, advantages, and limitations is given, followed by a description of their applications in various computer vision tasks, such as object detection, face recognition, action and activity recognition, and human pose estimation. Finally, a brief overview is given of future directions in designing deep learning schemes for computer vision problems and the challenges involved therein.
Full-text available
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
Full-text available
Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.
Full-text available
For mobile robot manipulation, autonomous object detection and localization is at the present still an open issue. In this paper is presented a method for detection and localization of simple colored geometric objects like cubes, prisms and cylinders, located over a table. The method proposed uses a passive stereovision system and consists on two steps. The first is colored object detection, where it is used a combination of a color segmentation procedure with an edge detection method, to restrict colored regions. Second step consists on pose recovery; where the merge of the colored objects detection mask is combined with the disparity map coming from stereo camera. Later step is very important to avoid noise inherent to the stereo correlation process. Filtered 3D data is then used to determine the main plane where the objects are posed, the table, and then the footprint is used to localize them in the stereo camera reference frame and then to the world reference frame.
To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.
Conference Paper
We present ROBOSHERLOCK, an open source software framework for implementing perception systems for robots performing human-scale everyday manipulation tasks. In ROBOSHERLOCK, perception and interpretation of realistic scenes is formulated as an unstructured information management (UIM) problem. The application of the UIM principle supports the implementation of perception systems that can answer task-relevant queries about objects in a scene, boost object recognition performance by combining the strengths of multiple perception algorithms, support knowledge-enabled reasoning about objects and enable automatic and knowledge-driven generation of processing pipelines. We demonstrate the potential of the proposed framework by three feasibility studies of systems for real-world scene perception that have been built on top of ROBOSHERLOCK.
Conference Paper
Description logics (DLs) are a family of logical formalisms that have initially been designed for the representation of conceptual knowledge in artificial intelligence and are closely related to modal logics. In the last two decades, DLs have been successfully applied in a wide range of interesting application areas. In most of these applications, it is important to equip DLs with expressive means that allow one to describe “concrete qualities” of real-world objects such as their weight, temperature, and spatial extension. The standard approach is to augment description logics with so-called concrete domains, which consist of a set (say, the rational numbers), and a set of n-ary predicates with a fixed extension over this set. The “interface” between the DL and the concrete domain is then provided by a new logical constructor that has, to the best of our knowledge, no counterpart in modal logics. In this paper, we give an overview over description logics with concrete domains and summarize decidability and complexity results from the literature.
Learning to recognize new objects using deep learning and contextual information
  • barkmeyer
Niklas Barkmeyer. "Learning to recognize new objects using deep learning and contextual information". MA thesis. Technical University of Munich, Sept. 2016.
RobotVQA: Scene-graph-oriented Visual Scene Understanding for Complex Robot Manipulation Tasks based on Deep Learning Architectures and Virtual Reality
  • kenghagho kenfack
Franklin Kenghagho Kenfack. RobotVQA: Scene-graphoriented Visual Scene Understanding for Complex Robot Manipulation Tasks based on Deep Learning Architectures and Virtual Reality. Accessed: 2019-02-23. URL: https: //
Final version appeared in Advanced in Modal Logic Volume
  • C Lutz
C. Lutz. "Description Logics with Concrete Domains-A Survey". In: Advances in Modal Logic 2002 (AiML 2002). Final version appeared in Advanced in Modal Logic Volume 4, 2003. Toulouse, France, 2002.
Visual Data Combination for Object Detection and Localization for Autonomous Robot Manipulation Tasks
  • A Luis
  • Morgado-Ramirez
Luis A. Morgado-Ramirez et al. "Visual Data Combination for Object Detection and Localization for Autonomous Robot Manipulation Tasks". In: Research in Computing Science (2011).
A simple neural network module for relational reasoning
  • Adam Santoro
Adam Santoro et al. "A simple neural network module for relational reasoning". In: CoRR abs/1706.01427 (2017). arXiv: 1706.01427. URL: 1706.01427.