Conference Paper

3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recently, most existing methods [2,6,18] establish explicit mappings between semantic affordance categories and geometric structures, restricted to predefined seen categories and fail to ground object affordance out of the training categories. Thus, some studies [27,42,50,56,58] explore grounding object affordance through additional instructions, encompassing combining images or languages that depict interactions with 3D geometries to introduce external interaction priors, and mitigate the generalization gap lead by affordance diversity. ...
... However, robot manipulation usually requires 3D information of objects, and the 2D affordance grounding obtained from the above works make it difficult to infer the interaction position of 3D objects. With the presentation of several 3D object datasetset [5,39,46], some works focus on the 3D object affordance grounding [6,39,40], which map semantic affordance to 3D object structure and fail to handle the open-vocabulary scenario. ...
... We construct the Point Image Affordance Dataset v2 (PIADv2), which comprises paired 2D interaction images and 3D object point clouds. Points are mainly collected from 3DIR [57], 3D-AffordanceNet [6], objaverse [5], etc. Images are mainly collected from AGD20k [34], OpenImage [19] and websites with free licenses. In total, PIADv2 contains 15213 images and 38889 point clouds, covering 43 object and 24 affordance categories, which is the largest scale 3D object affordance dataset so far. ...
Preprint
Open-Vocabulary 3D object affordance grounding aims to anticipate ``action possibilities'' regions on 3D objects with arbitrary instructions, which is crucial for robots to generically perceive real scenarios and respond to operational changes. Existing methods focus on combining images or languages that depict interactions with 3D geometries to introduce external interaction priors. However, they are still vulnerable to a limited semantic space by failing to leverage implied invariant geometries and potential interaction intentions. Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. Besides, we introduce the Point Image Affordance Dataset v2 (PIADv2), the largest 3D object affordance dataset at present to support the task. Extensive experiments demonstrate the effectiveness and superiority of GREAT. Code and dataset are available at project.
... In this process, we freeze most of the Llama-2 weight parameters and apply the fine-tuning strategy of the LLaMA-Adapter [33] to predict a special <mask label>. Based on the 3D AffordanceNet benchmark [43], we define 18 categories of affordance mask labels. To encode the output labels, we employ a one-hot encoding scheme, transforming the hidden state of the <mask label> into a query embedding q. ...
... For the comprehensive training of both the geometricguided encoder and the 3D affordance decoder, we adopt a multi-objective loss function that integrates several components to ensure effective learning. The core affordance loss L af f is composed of a cross-entropy loss L ce and a Dice loss L dice , following the practice established in prior work [43]. This combination helps to optimize both the pixelwise classification and spatial consistency of the affordance predictions. ...
... Dataset. We primarily evaluate our methods using the 3D AffordanceNet benchmark [43], which is specifically designed for visual affordance understanding in 3D. The dataset contains 22,949 3D object models across 23 semantic categories and covers 18 affordance categories. ...
Preprint
Full-text available
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt Llama-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects. For more information, visit our project site: pavlm-source.github.io.
... Grasp affordances. It is important to consider the task or ability afforded by parts of objects [19,55] when interacting with them. Several methods have extended object affordances to the affordance provided by the grasp to enable task-oriented manipulation [31]. ...
... Using the local rendered features, we train another decoder network d a γ with parameters γ to predict the grasp affordanceâ(g). The affordances belong to a set of K object affordance classes from 3D AffordanceNet [19]. Given that a grasp can have more than one affordance, this becomes a multi-label classification problem for which we can use separate heads to predict the probability of each affordance class {a k } K k=1 . ...
... For this, we use another binary cross entropy loss between the predicted affordanceâ k (g) = d a k γ {p g i , ψ(x, p g i )} N i=1 and the ground truth label a k (g) ∈ {0, 1} for every class k. Moreover, since the multi-label dataset is imbalanced and can contain many more negative labels than positive ones, we add a dice loss as proposed in [19]. This loss ensures that the total error in the prediction of positive and negative labels in a batch is weighted equally. ...
... Human-object interaction (HOI) understanding aims to excavate co-occurrence relations and interaction attributes between humans and objects [91,103]. For egocentric interactions, in addition to capturing interaction semantics like what the subject is doing or what the interacting object is [5,25], knowing where the interaction specifically manifests in space e.g., human contact [9,80] and object affordance [16,24] is also crucial. The precise delineation of spatial regions constitutes a pivotal component in numerous applications, like interacting with the scene in embodied AI [17,71], interaction modeling in graphics [28,95], robotics manipulation [52,113], and AR/VR [11]. ...
... Perceiving Interaction Regions in 3D Space. For 3D interaction regions, dense human contact [80] and 3D object affordance [16,24] recently get much attention in the field. Methods estimate them typically follow two paradigms, one of which is to directly establish a mapping between geometries and semantics [28,55,56,89,96,109], e.g., "sit" links the seat of chairs, as well as the buttocks and thighs of humans. ...
... The calculated contacts in GIMO are also manually refined. Furthermore, we refer to the annotation pipeline [16,102] to annotate the 3D object affordance. Please refer to the appendix for details of the dataset annotation and partition. ...
Preprint
Full-text available
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception, facilitating applications like AR/VR and embodied AI. For the egocentric HOI, in addition to perceiving semantics e.g., ''what'' interaction is occurring, capturing ''where'' the interaction specifically manifests in 3D space is also crucial, which links the perception and operation. Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view. However, incomplete observations of interacting parties in the egocentric view introduce ambiguity between visual observations and interaction contents, impairing their efficacy. From the egocentric view, humans integrate the visual cortex, cerebellum, and brain to internalize their intentions and interaction concepts of objects, allowing for the pre-formulation of interactions and making behaviors even when interaction regions are out of sight. In light of this, we propose harmonizing the visual appearance, head motion, and 3D object to excavate the object interaction concept and subject intention, jointly inferring 3D human contact and object affordance from egocentric videos. To achieve this, we present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance, further utilizing it to model human contact. Additionally, a gradient modulation is employed to adopt appropriate clues for capturing interaction regions across various egocentric scenarios. Moreover, 3D contact and affordance are annotated for egocentric videos collected from Ego-Exo4D and GIMO to support the task. Extensive experiments on them demonstrate the effectiveness and superiority of EgoChoir. Code and data will be open.
... Subsequent works like Mask3D [16] and OneFormer3D [17] have pushed the boundaries in semantic and instance segmentation with transformer models, enhancing performance. In affordance prediction on 3D point clouds [18]- [22], 3D AffordanceNet [18] provides a benchmark across 18 affordance categories, with 3DAPNet [19] simultaneously predicting affordance regions and generating corresponding 6DoF poses for action affordances. ...
... Subsequent works like Mask3D [16] and OneFormer3D [17] have pushed the boundaries in semantic and instance segmentation with transformer models, enhancing performance. In affordance prediction on 3D point clouds [18]- [22], 3D AffordanceNet [18] provides a benchmark across 18 affordance categories, with 3DAPNet [19] simultaneously predicting affordance regions and generating corresponding 6DoF poses for action affordances. ...
Preprint
Solving storage problem: where objects must be accurately placed into containers with precise orientations and positions, presents a distinct challenge that extends beyond traditional rearrangement tasks. These challenges are primarily due to the need for fine-grained 6D manipulation and the inherent multi-modality of solution spaces, where multiple viable goal configurations exist for the same storage container. We present a novel Diffusion-based Affordance Prediction (DAP) pipeline for the multi-modal object storage problem. DAP leverages a two-step approach, initially identifying a placeable region on the container and then precisely computing the relative pose between the object and that region. Existing methods either struggle with multi-modality issues or computation-intensive training. Our experiments demonstrate DAP's superior performance and training efficiency over the current state-of-the-art RPDiff, achieving remarkable results on the RPDiff benchmark. Additionally, our experiments showcase DAP's data efficiency in real-world applications, an advancement over existing simulation-driven approaches. Our contribution fills a gap in robotic manipulation research by offering a solution that is both computationally efficient and capable of handling real-world variability. Code and supplementary material can be found at: https://github.com/changhaonan/DPS.git.
... A broad spectrum of work [10,15,18,45,46,48,49] focuses on learning from hand-crafted affordance datasets, which require extensive data collection and costly annotation. In contrast, humans typically acquire knowledge of affordances in a more efficient manner, either through trial-and-error interactions or by observing others. ...
... Initial successes in this field were achieved through fully supervised methods [9,18,22,34,46] that employ convolutional neural networks (CNNs). These methods, however, heavily depend on large-scale, annotated datasets [10,15,18,45,48,49], which are costly and time-consuming to produce. To reduce annotation costs, recent interest has shifted to weakly-supervised approaches, utilizing keypoints [62,63] and image-level labels [35,43,47]. ...
Preprint
Affordance, defined as the potential actions that an object offers, is crucial for robotic manipulation tasks. A deep understanding of affordance can lead to more intelligent AI systems. For example, such knowledge directs an agent to grasp a knife by the handle for cutting and by the blade when passing it to someone. In this paper, we present a streamlined affordance learning system that encompasses data collection, effective model training, and robot deployment. First, we collect training data from egocentric videos in an automatic manner. Different from previous methods that focus only on the object graspable affordance and represent it as coarse heatmaps, we cover both graspable (e.g., object handles) and functional affordances (e.g., knife blades, hammer heads) and extract data with precise segmentation masks. We then propose an effective model, termed Geometry-guided Affordance Transformer (GKT), to train on the collected data. GKT integrates an innovative Depth Feature Injector (DFI) to incorporate 3D shape and geometric priors, enhancing the model's understanding of affordances. To enable affordance-oriented manipulation, we further introduce Aff-Grasp, a framework that combines GKT with a grasp generation model. For comprehensive evaluation, we create an affordance evaluation dataset with pixel-wise annotations, and design real-world tasks for robot experiments. The results show that GKT surpasses the state-of-the-art by 15.9% in mIoU, and Aff-Grasp achieves high success rates of 95.5% in affordance prediction and 77.1% in successful grasping among 179 trials, including evaluations with seen, unseen objects, and cluttered scenes.
... The exploration of directly inferring and detecting the desired target based on human intention instructions in the 3D domain remains an uncharted area. Affordance Understanding Derived from the concept in [14], affordance detection aims to localize an object or a part of object in 2D [8,36,37,45] or 3D [11,54] focusing on feasible/affordable actions that can be executed while engaging with the object of interest. However, the affordable action still cannot convey human intention; for example, the action "grasp a bottle" does not necessarily equate to the underlying intention "I want something to drink". ...
... However, the affordable action still cannot convey human intention; for example, the action "grasp a bottle" does not necessarily equate to the underlying intention "I want something to drink". Furthermore, the dependency on phrase expressions [36], closed-set classifications [8,37,45,54], and object-centered approaches [11] restricts its ability to interpret free-form intention and make inferences about 3D scenes. ...
Preprint
In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization.
... Humans interact with objects in various ways, e.g., lifting, dragging, sitting, sleeping upon. Human object interaction [19,20,74,101,107] focuses on understanding or generating affordances and contact points for a given object. Likewise, hand-object interaction [36,46,73,82,109] predicts the hand pose when performing actions like grasping or holding for a given object. ...
Preprint
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person's actions and their environment to predict both the 'where' and 'how' of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
... In recent years, the rapid growth of robotics, autonomous driving, and embodied intelligence has thrust affordance learning into prominence [10]. Early works [11,12,14,26,27] mainly adopt the label assignment paradigm, positing an intrinsic correlation between functional regions and their corresponding affordance labels. However, it encounters challenges in addressing the multiplicity of affordance interpretations, which arise from variations in environmental contexts and operator behavior. ...
Preprint
Full-text available
Perceiving potential ``action possibilities'' (\ie, affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, \ie extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. To this end, we propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues to excavate interactive affinity from human-object interactions jointly. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 55,047 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Project: \href{https://github.com/lhc1224/VCR-Net}{github.com/lhc1224/VCR-Net}.
... Robert et al. proposed a multi-view feature aggregation method based on an end-to-end implicit model for large-scale 3D scene segmentation problems, fusing 2D and 3D information for large-scale scene semantic segmentation [30]. These methods use 3D ground truth labels in the learning process and have a profound impact on derivative applications of 3D scene understanding, including 3D object classification [31], 3D object detection and localization [32]- [34], 3D semantic and instance segmentation [35]- [37], 3D feasibility prediction [38]- [40], and so on. ...
Preprint
Accurate 3D scene representation and panoptic understanding are essential for applications such as virtual reality, robotics, and autonomous driving. However, challenges persist with existing methods, including precise 2D-to-3D mapping, handling complex scene characteristics like boundary ambiguity and varying scales, and mitigating noise in panoptic pseudo-labels. This paper introduces a novel perceptual-prior-guided 3D scene representation and panoptic understanding method, which reformulates panoptic understanding within neural radiance fields as a linear assignment problem involving 2D semantics and instance recognition. Perceptual information from pre-trained 2D panoptic segmentation models is incorporated as prior guidance, thereby synchronizing the learning processes of appearance, geometry, and panoptic understanding within neural radiance fields. An implicit scene representation and understanding model is developed to enhance generalization across indoor and outdoor scenes by extending the scale-encoded cascaded grids within a reparameterized domain distillation framework. This model effectively manages complex scene attributes and generates 3D-consistent scene representations and panoptic understanding outcomes for various scenes. Experiments and ablation studies under challenging conditions, including synthetic and real-world scenes, demonstrate the proposed method's effectiveness in enhancing 3D scene representation and panoptic segmentation accuracy.
... With the introduction of several 3D object datasets, some researchers explore affordance grounding using 3D data. Certain approaches directly map semantic affordance to specific parts of 3D object regions [29]- [31]. However, the absence of real-world interaction can hinder the generalization capability, especially when object structures do not correspond neatly to specific affordance. ...
Preprint
Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent's ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.
... We present a comprehensive dataset designed for evaluating interactive language-guided affordance segmentation in everyday environments, building upon the work of [24], [41]. Our dataset encompasses 10 diverse indoor settings commonly encountered in daily life, with each environment represented by 50 high-quality annotated images. ...
Preprint
Recent advancements in large language and vision-language models have significantly enhanced multimodal understanding, yet translating high-level linguistic instructions into precise robotic actions in 3D space remains challenging. This paper introduces IRIS (Interactive Responsive Intelligent Segmentation), a novel training-free multimodal system for 3D affordance segmentation, alongside a benchmark for evaluating interactive language-guided affordance in everyday environments. IRIS integrates a large multimodal model with a specialized 3D vision network, enabling seamless fusion of 2D and 3D visual understanding with language comprehension. To facilitate evaluation, we present a dataset of 10 typical indoor environments, each with 50 images annotated for object actions and 3D affordance segmentation. Extensive experiments demonstrate IRIS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight IRIS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic systems for real-world applications.
... Therefore, affordance learning has been widely studied in the literature. Deng et al. built a benchmark for visual object affordance understanding by manual annotation [34]. Cui et al. explored learning affordances using point supervision [35]. ...
Preprint
Articulated object manipulation is ubiquitous in daily life. In this paper, we present DexSim2Real2^{2}, a novel robot learning framework for goal-conditioned articulated object manipulation using both two-finger grippers and multi-finger dexterous hands. The key of our framework is constructing an explicit world model of unseen articulated objects through active one-step interactions. This explicit world model enables sampling-based model predictive control to plan trajectories achieving different manipulation goals without needing human demonstrations or reinforcement learning. It first predicts an interaction motion using an affordance estimation network trained on self-supervised interaction data or videos of human manipulation from the internet. After executing this interaction on the real robot, the framework constructs a digital twin of the articulated object in simulation based on the two point clouds before and after the interaction. For dexterous multi-finger manipulation, we propose to utilize eigengrasp to reduce the high-dimensional action space, enabling more efficient trajectory searching. Extensive experiments validate the framework's effectiveness for precise articulated object manipulation in both simulation and the real world using a two-finger gripper and a 16-DoF dexterous hand. The robust generalizability of the explicit world model also enables advanced manipulation strategies, such as manipulating with different tools.
... arXiv:2407.09648v1 [cs.CV] 12 Jul 2024 of the world, objects can be decomposed into parts based on diverse properties (e.g., geometry or affordance [9,24]). However, these different decompositions do not always align with one another-the same object can be segmented into parts differently depending on the specific use case. ...
Preprint
Full-text available
3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. Project website: \url{https://ngailapdi.github.io/projects/3by2/}.
... The employment of attention mechanisms has emerged as a promising strategy for the enhancement of deep convolutional neural networks and has seen widespread application across a spectrum of research fields, including semantic segmentation [8,9], object detection [10,11], and object recognition [12,13]. Specifically, in object recognition tasks, the incorporation of attention mechanisms serves to bolster the expressivity of salient features, thereby bolstering recognition precision. ...
Article
Full-text available
Pooling operations, essential for neural networks, reduce feature map dimensions while preserving key features and enhancing spatial invariance. Traditional pooling methods often miss the feature maps' alternating currentcomponents. This study introduces a novel global pooling technique utilizing spectral self‐attention, leveraging the discrete cosine transform for spectral analysis and a self‐attention mechanism for assessing frequency component significance. This approach allows for efficient feature synthesis through weighted averaging, significantly boosting TOP‐1 accuracy with minimal parameter increase, outperforming existing models.
... Affordance Grounding. Visual affordance grounding has been intensively explored in the fields of robotics and computer vision [9,15,25,31,[34][35][36]46,47,[51][52][53]. Traditional approaches [6,10,11,29,39] mainly learn the affordance through fully supervised learning. ...
Preprint
Affordance grounding aims to localize the interaction regions for the manipulated objects in the scene image according to given instructions. A critical challenge in affordance grounding is that the embodied agent should understand human instructions and analyze which tools in the environment can be used, as well as how to use these tools to accomplish the instructions. Most recent works primarily supports simple action labels as input instructions for localizing affordance regions, failing to capture complex human objectives. Moreover, these approaches typically identify affordance regions of only a single object in object-centric images, ignoring the object context and struggling to localize affordance regions of multiple objects in complex scenes for practical applications. To address this concern, for the first time, we introduce a new task of affordance grounding based on natural language instructions, extending it from previously using simple labels for complex human instructions. For this new task, we propose a new framework, WorldAfford. We design a novel Affordance Reasoning Chain-of-Thought Prompting to reason about affordance knowledge from LLMs more precisely and logically. Subsequently, we use SAM and CLIP to localize the objects related to the affordance knowledge in the image. We identify the affordance regions of the objects through an affordance region localization module. To benchmark this new task and validate our framework, an affordance grounding dataset, LLMaFF, is constructed. We conduct extensive experiments to verify that WorldAfford performs state-of-the-art on both the previous AGD20K and the new LLMaFF dataset. In particular, WorldAfford can localize the affordance regions of multiple objects and provide an alternative when objects in the environment cannot fully match the given instruction.
... , achieve weakly supervised affordance detection using only a few key points. Deng et al. (2021) expand affordance detection to 3D scenes. Luo et al. (2021b), Zhai et al. (2022b) explore human purpose-driven object affordance detection in unseen scenarios. ...
Article
Full-text available
Affordance grounding aims to locate objects’ “action possibilities” regions, an essential step toward embodied intelligence. Due to the diversity of interactive affordance, i.e., the uniqueness of different individual habits leads to diverse interactions, which makes it difficult to establish an explicit link between object parts and affordance labels. Human has the ability that transforms various exocentric interactions into invariant egocentric affordance to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from the exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. However, there is some “interaction bias” between personas, mainly regarding different regions and views. To this end, we devise a cross-view affordance knowledge transfer framework that extracts affordance-specific features from exocentric interactions and transfers them to the egocentric view to solve the above problems. Furthermore, the perception of affordance regions is enhanced by preserving affordance co-relations. In addition, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. The code is available via: github.com/lhc1224/Cross-View-AG.
Preprint
Full-text available
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. For example, given a task like 'turn on the ceiling light', an embodied AI agent must infer that it needs to locate the light switch, even though the switch is not explicitly mentioned in the task description. To date, no dedicated methods have been developed for this problem. In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches. Code will be released publicly.
Chapter
Deep learning-based affordance research is motivated by the need to achieve higher degrees of automation and accuracy. On the contrary, the concept of affordance originated to explain the intuitive intelligent responses of humans in their interaction with the environment. With a state-of-the-art supervised deep learning model, achieving high accuracy in an object affordance recognition challenge is excellent and normal nowadays. Does it ensure that this model understands object affordance correctly and explains its decisions to users? Deep learning algorithms and architectures for affordance research must possess an understanding of their own. We present a brief study on the existing literature on explainable affordance research and offer few suggestions to improve the explainability and interpretability of current deep learning methods for affordance detection. Three pretrained vision models are considered for supervised object affordance classification without having affordance heatmaps as teaching signal. The output of these models obtained after the experimentation over modified CAD-120 dataset is fed to smooth grad-cam++ for post hoc explainability analysis. These experiments lead to a proposal of a framework for object affordance classification.
Preprint
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. It depends both on common-sense knowledge on the objective side and human user preference on the subjective side. In achieving such task, we propose to mine object functionality with user preference alignment directly from the scene itself, without relying on human intervention. To do so, we work with scene graph representation and propose LLM-enhanced scene graph learning which transforms the input scene graph into an affordance-enhanced graph (AEG) with information-enhanced nodes and newly discovered edges (relations). In AEG, the nodes corresponding to the receptacle objects are augmented with context-induced affordance which encodes what kind of carriable objects can be placed on it. New edges are discovered with newly discovered non-local relations. With AEG, we perform task planning for scene rearrangement by detecting misplaced carriables and determining a proper placement for each of them. We test our method by implementing a tiding robot in simulator and perform evaluation on a new benchmark we build. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on misplacement detection and the following rearrangement planning.
Preprint
The versatility and adaptability of human grasping catalyze advancing dexterous robotic manipulation. While significant strides have been made in dexterous grasp generation, current research endeavors pivot towards optimizing object manipulation while ensuring functional integrity, emphasizing the synthesis of functional grasps following desired affordance instructions. This paper addresses the challenge of synthesizing functional grasps tailored to diverse dexterous robotic hands by proposing DexGrasp-Diffusion, an end-to-end modularized diffusion-based pipeline. DexGrasp-Diffusion integrates MultiHandDiffuser, a novel unified data-driven diffusion model for multi-dexterous hands grasp estimation, with DexDiscriminator, which employs a Physics Discriminator and a Functional Discriminator with open-vocabulary setting to filter physically plausible functional grasps based on object affordances. The experimental evaluation conducted on the MultiDex dataset provides substantiating evidence supporting the superior performance of MultiHandDiffuser over the baseline model in terms of success rate, grasp diversity, and collision depth. Moreover, we demonstrate the capacity of DexGrasp-Diffusion to reliably generate functional grasps for household objects aligned with specific affordance instructions.
Article
In this work, we present a multi-stage pipeline that aims to accurately predict suction grasps for objects with varying properties in cluttered and complex scenes. Existing methods face difficulties in generalizing to unseen objects and effectively handling noisy depth/point cloud data, which often leads to inaccurate grasp predictions. To address these challenges, we utilize the Unseen Object Instance Segmentation (UOIS) technique to segment all objects and extract their instance-level point clouds. Additionally, we introduce the Uncertainty-aware Instance-level Scoring Network (UISN) to generate point-wise suction scores with uncertainties for each object. By calibrating the predicted scores using the estimated uncertainties, we further enhance their reliability for unseen objects and noisy data. Finally, the grasping candidates are ranked based on the calibrated scores, then the most promising grasps can be executed. Our approach achieves state-of-the-art performance on the SuctionNet-1Billion benchmark and demonstrates real robotic grasping, showcasing its accuracy and robustness in cluttered scenes. The code, models, and datasets for training and evaluation are accessible at https://github.com/rcao-hk/UISN .
Article
Traditional affordance learning tasks aim to understand object’s interactive functions in an image, such as affordance recognition and affordance detection. However, these tasks cannot determine whether the object is currently interacting, which is crucial for many follow-up tasks, including robotic manipulation and planning task. To fill this gap, this paper proposes a novel object affrodance state (OAS) recognition task, i.e., simultaneously recognizing an object’s affordances and the partner objects that are interacting with it. Accordingly, to facilitate the application of deep learning technology, an OAS recognition task related dataset OAS10k is constructed by collecting and labeling over 10k images. In the dataset, a sample is defined as a set of an image and its OAS labels, each label is represented as 〈 subject , subject’s affrodance , interacted object 〉. These triplet labels have rich relational semantic information, which can improve OAS recognition performance. We hence construct a directed OAS knowledge graph of affordance states, and extract an OAS matrix from it for modelling the semantic relationships of the triplets. Based on the matrix, we propose an OAS recognition network (OASNet), which utilizes GCN to capture the relational semantic embeddings, and uses a transformer to fuse them with the visual features from an image to recognize the affordance states of objects in the image. Experimental results on OAS10k dataset and other triplet label recognition datasets demonstrate that the proposed OASNet achieves the best performance compared to the state-of-the-art methods. The dataset and codes will be released on https://github.com/mxmdpc/OAS.
Article
Full-text available
We address the problem of 3D rotation equivariance in convolutional neural networks. 3D rotations have been a challenging nuisance in 3D classification tasks requiring higher capacity and extended data augmentation in order to tackle it. We model 3D data with multi-valued spherical functions and we propose a novel spherical convolutional network that implements exact convolutions on the sphere by realizing them in the spherical harmonic domain. Resulting filters have local symmetry and are localized by enforcing smooth spectra. We apply a novel pooling on the spectral domain and our operations are independent of the underlying spherical resolution throughout the network. We show that networks with much lower capacity and without requiring data augmentation can exhibit performance comparable to the state of the art in standard 3D shape retrieval and classification benchmarks.
Conference Paper
Full-text available
We address the problem of 3D rotation equivariance in convolutional neural networks. 3D rotations have been a challenging nuisance in 3D classification tasks requiring higher capacity and extended data augmentation in order to tackle it. We model 3D data with multi-valued spherical functions and we propose a novel spherical convolutional network that implements exact convolutions on the sphere by realizing them in the spherical harmonic domain. Resulting filters have local symmetry and are localized by enforcing smooth spectra. We apply a novel pooling on the spectral domain and our operations are independent of the underlying spherical resolution throughout the network. We show that networks with much lower capacity and without requiring data augmentation can exhibit performance comparable to the state of the art in standard retrieval and classification benchmarks.
Article
Full-text available
Point clouds provide a flexible and scalable geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. Hence, the design of intelligent computational models that act directly on point clouds is critical, especially when efficiency considerations or noise preclude the possibility of expensive denoising and meshing procedures. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv is differentiable and can be plugged into existing architectures. Compared to existing modules operating largely in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked or recurrently applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. Beyond proposing this module, we provide extensive evaluation and analysis revealing that EdgeConv captures and exploits fine-grained geometric properties of point clouds. The proposed approach achieves state-of-the-art performance on standard benchmarks including ModelNet40 and S3DIS.
Conference Paper
Full-text available
We present a new method to detect object affordances in real-world scenes using deep Convolutional Neural Networks (CNN), an object detector and dense Conditional Random Fields (CRF). Our system first trains an object detector to generate bounding box candidates from the images. A deep CNN is then used to learn the depth features from these bounding boxes. Finally, these feature maps are post-processed with dense CRF to improve the prediction along class boundaries. The experimental results on our new challenging dataset show that the proposed approach outperforms recent state-of-the-art methods by a substantial margin. Furthermore, from the detected affordances we introduce a grasping method that is robust to noisy data. We demonstrate the effectiveness of our framework on the full-size humanoid robot WALK-MAN using different objects in real-world scenarios.
Article
Full-text available
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
Article
Full-text available
Large repositories of 3D shapes provide valuable input for data-driven analysis and modeling tools. They are especially powerful once annotated with semantic information such as salient regions and functional parts. We propose a novel active learning method capable of enriching massive geometric datasets with accurate semantic region annotations. Given a shape collection and a user-specified region label our goal is to correctly demarcate the corresponding regions with minimal manual work. Our active framework achieves this goal by cycling between manually annotating the regions, automatically propagating these annotations across the rest of the shapes, manually verifying both human and automatic annotations, and learning from the verification results to improve the automatic propagation algorithm. We use a unified utility function that explicitly models the time cost of human input across all steps of our method. This allows us to jointly optimize for the set of models to annotate and for the set of models to verify based on the predicted impact of these actions on the human efficiency. We demonstrate that incorporating verification of all produced labelings within this unified objective improves both accuracy and efficiency of the active learning procedure. We automatically propagate human labels across a dynamic shape network using a conditional random field (CRF) framework, taking advantage of global shape-to-shape similarities, local feature similarities, and point-to-point correspondences. By combining these diverse cues we achieve higher accuracy than existing alternatives. We validate our framework on existing benchmarks demonstrating it to be significantly more efficient at using human input compared to previous techniques. We further validate its efficiency and robustness by annotating a massive shape dataset, labeling over 93,000 shape parts, across multiple model classes, and providing a labeled part collection more than one order of magnitude larger than existing ones.
Article
Full-text available
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.
Article
Full-text available
Convolutional Neural Networks (CNNs) have been recently employed to solve problems from both the computer vision and medical image analysis fields. Despite their popularity, most approaches are only able to process 2D images while most medical data used in clinical practice consists of 3D volumes. In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network. Our CNN is trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once. We introduce a novel objective function, that we optimise during training, based on Dice coefficient. In this way we can deal with situations where there is a strong imbalance between the number of foreground and background voxels. To cope with the limited number of annotated volumes available for training, we augment the data applying random non-linear transformations and histogram matching. We show in our experimental evaluation that our approach achieves good performances on challenging test data while requiring only a fraction of the processing time needed by other previous methods.
Article
Full-text available
We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy. It is a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Annotations are made available through a public web-based interface to enable data visualization of object attributes, promote data-driven geometric analysis, and provide a large-scale quantitative benchmark for research in computer graphics and vision. At the time of this technical report, ShapeNet has indexed more than 3,000,000 models, 220,000 models out of which are classified into 3,135 categories (WordNet synsets). In this report we describe the ShapeNet effort as a whole, provide details for all currently available datasets, and summarize future plans.
Article
Full-text available
As robots begin to collaborate with humans in everyday workspaces, they will need to understand the functions of tools and their parts. To cut an apple or hammer a nail, robots need to not just know the tool's name, but they must localize the tool's parts and identify their functions. Intuitively, the geometry of a part is closely related to its possible functions, or its affordances. Therefore, we propose two approaches for learning affordances from local shape and geometry primitives: 1) superpixel based hierarchical matching pursuit (S-HMP); and 2) structured random forests (SRF). Moreover, since a part can be used in many ways, we introduce a large RGB-Depth dataset where tool parts are labeled with multiple affordances and their relative rankings. With ranked affordances, we evaluate the proposed methods on 3 cluttered scenes and over 105 kitchen, workshop and garden tools, using ranked correlation and a weighted F-measure score [26]. Experimental results over sequences containing clutter, occlusions, and viewpoint changes show that the approaches return precise predictions that could be used by a robot. S-HMP achieves high accuracy but at a significant computational cost, while SRF provides slightly less accurate predictions but in real-time. Finally, we validate the effectiveness of our approaches on the Cornell Grasping Dataset [25] for detecting graspable regions, and achieve state-of-the-art performance.
Conference Paper
Full-text available
Human actions naturally co-occur with scenes. In this work we aim to discover action-scene correlation for a large number of scene categories and to use such correlation for action prediction. Towards this goal, we collect a new SUN Action dataset with manual annotations of typical human actions for 397 scenes. We next discover action-scene associations and demonstrate that scene categories can be well identified from their associated actions. Using discovered associations, we address a new task of predicting human actions for images of static scenes. We evaluate prediction of 23 and 38 action classes for images of indoor and outdoor scenes respectively and show promising results. We also propose a new application of geo-localized action prediction and demonstrate ability of our method to automatically answer queries such as “Where is a good place for a picnic?” or “Can I cycle along this path?”.
Article
Full-text available
Understanding human activities and object affordances are two very important skills, especially for personal robots which operate in human environments. In this work, we consider the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances. Given a RGB-D video, we jointly model the human activities and object affordances as a Markov random field where the nodes represent objects and sub-activities, and the edges represent the relationships between object affordances, their relations with sub-activities, and their evolution over time. We formulate the learning problem using a structural support vector machine (SSVM) approach, where labelings over various alternate temporal segmentations are considered as latent variables. We tested our method on a challenging dataset comprising 120 activity videos collected from 4 subjects, and obtained an accuracy of 79.4% for affordance, 63.4% for sub-activity and 75.0% for high-level activity labeling. We then demonstrate the use of such descriptive labeling in performing assistive tasks by a PR2 robot.
Conference Paper
Full-text available
Many object classes are primarily defined by their functions. However, this fact has been left largely unexploited by visual object categorization or detection systems. We propose a method to learn an affordance detector. It identifies locations in the 3d space which “support” the particular function. Our novel approach “imagines” an actor performing an action typical for the target object class, instead of relying purely on the visual object appearance. So, function is handled as a cue complementary to appearance, rather than being a consideration after appearance-based detection. Experimental results are given for the functional category “sitting”. Such affordance is tested on a 3d representation of the scene, as can be realistically obtained through SfM or depth cameras. In contrast to appearance-based object detectors, affordance detection requires only very few training examples and generalizes very well to other sittable objects like benches or sofas when trained on a few chairs.
Conference Paper
Full-text available
This paper proposes a simple and fast operator, the "Hidden" Point Removal operator, which determines the visible points in a point cloud, as viewed from a given viewpoint. Visibility is determined without reconstructing a surface or estimating normals. It is shown that extracting the points that reside on the convex hull of a transformed point cloud, amounts to determining the visible points. This operator is general - it can be applied to point clouds at various dimensions, on both sparse and dense point clouds, and on viewpoints internal as well as external to the cloud. It is demonstrated that the operator is useful in visualizing point clouds, in view-dependent reconstruction and in shadow casting.
Article
Nowadays, robots are dominating the manufacturing, entertainment, and healthcare industries. Robot vision aims to equip robots with the capabilities to discover information, understand it, and interact with the environment, which require an agent to effectively understand object affordances and functions in complex visual domains. In this literature survey, first, “visual affordances” are focused on and current state-of-the-art approaches for solving relevant problems as well as open problems and research gaps are summarized. Then, sub-problems, such as affordance detection, categorization, segmentation, and high-level affordance reasoning, are specifically discussed. Furthermore, functional scene understanding and its prevalent descriptors used in the literature are covered. This survey also provides the necessary background to the problem, sheds light on its significance, and highlights the existing challenges for affordance and functionality learning.
Chapter
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (e.g., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suit of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets – demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning.
Article
We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.
Article
This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal stochastic grammar defined on sub-activities, and spatial graphs representing sub-activities that consist of human actions, objects, and their affordances. Future sub-activities are predicted using the temporal grammar and Earley parsing algorithm. The corresponding action, object, and affordance labels are then inferred accordingly. Extensive experiments are conducted to show the effectiveness of our model on both semantic event parsing and future activity prediction.
Article
We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the output distribution. Virtual adversarial loss is defined as the robustness of the model's posterior distribution against local perturbation around each input data point. Our method is similar to adversarial training, but differs from adversarial training in that it determines the adversarial direction based only on the output distribution and that it is applicable to a semi-supervised setting. Because the directions in which we smooth the model are virtually adversarial, we call our method virtual adversarial training (VAT). The computational cost of VAT is relatively low. For neural networks, the approximated gradient of virtual adversarial loss can be computed with no more than two pairs of forward and backpropagations. In our experiments, we applied VAT to supervised and semi-supervised learning on multiple benchmark datasets. With additional improvement based on entropy minimization principle, our VAT achieves the state-of-the-art performance on SVHN and CIFAR-10 for semi-supervised learning tasks.
Article
The recently proposed temporal ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, temporal ensembling becomes unwieldy when using large datasets. To overcome this problem, we propose a method that averages model weights instead of label predictions. As an additional benefit, the method improves test accuracy and enables training with fewer labels than earlier methods. We report state-of-the-art results on semi-supervised SVHN, reducing the error rate from 5.12% to 4.41% with 500 labels, and achieving 5.39% error rate with 250 labels. By using extra unlabeled data, we reduce the error rate to 2.76% on 500-label SVHN.
Article
In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce temporal ensembling, where we form a consensus prediction of the unknown labels under multiple instances of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the classification error rate from 18.63% to 12.89% in CIFAR-10 with 4000 labels and from 18.44% to 6.83% in SVHN with 500 labels.
Conference Paper
Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affordance types in indoor scenes: ‘walkable’, ‘sittable’, ‘lyable’, ‘reachable’, and ‘movable’. Our approach uses a deep architecture, consisting of a number of multi-scale convolutional neural networks, for extracting mid-level visual cues and combining them toward affordance segmentation. The mid-level cues include depth map, surface normals, and segmentation of four types of surfaces – namely, floor, structure, furniture and props. For evaluation, we augmented the NYUv2 dataset with new ground-truth annotations of the five affordance types. We are not aware of prior work which starts from pixels, infers mid-level cues, and combines them in a feed-forward fashion for predicting dense affordance maps of a single RGB image.
Article
We introduce a co-analysis method which learns a functionality model for an object category, e.g., strollers or backpacks. Like previous works on functionality, we analyze object-to-object interactions and intra-object properties and relations. Differently from previous works, our model goes beyond providing a functionality-oriented descriptor for a single object; it prototypes the functionality of a category of 3D objects by co-analyzing typical interactions involving objects from the category. Furthermore, our co-analysis localizes the studied properties to the specific locations, or surface patches, that support specific functionalities, and then integrates the patch-level properties into a category functionality model. Thus our model focuses on the how, via common interactions, and where, via patch localization, of functionality analysis. Given a collection of 3D objects belonging to the same category, with each object provided within a scene context, our co-analysis yields a set of proto-patches, each of which is a patch prototype supporting a specific type of interaction, e.g., stroller handle held by hand. The learned category functionality model is composed of proto-patches, along with their pairwise relations, which together summarize the functional properties of all the patches that appear in the input object category. With the learned functionality models for various object categories serving as a knowledge base, we are able to form a functional understanding of an individual 3D object, without a scene context. With patch localization in the model, functionality-aware modeling, e.g, functional object enhancement and the creation of functional object hybrids, is made possible.
Article
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular flexible tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks~(RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower a new convenient approach to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks, and be of broad interest to the community.
Article
An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for reactive responses. Furthermore, anticipation can even improve the detection accuracy of past activities. The challenge, however, is two-fold: We need to capture the rich context for modeling the activities and object affordances, and we need to anticipate the distribution over a large space of future human activities. In this work, we represent each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances. We then consider each ATCRF as a particle and represent the distribution over the potential futures using a set of particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, we first show that anticipation improves the state-ofthe- art detection results. We then show that for new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 84.1%, 74.4% and 62.2% for an anticipation time of 1, 3 and 10 seconds respectively. Finally, we also show a robot using our algorithm for performing a few reactive responses.
Article
Appearance-based estimation of grasp affordances is desirable when 3-D scans become unreliable due to clutter or material properties. We develop a general framework for estimating grasp affordances from 2-D sources, including local texture-like measures as well as object-category measures that capture previously learned grasp strategies. Local approaches to estimating grasp positions have been shown to be effective in real-world scenarios, but are unable to impart object-level biases and can be prone to false positives. We describe how global cues can be used to compute continuous pose estimates and corresponding grasp point locations, using a max-margin optimization for category-level continuous pose regression. We provide a novel dataset to evaluate visual grasp affordance estimation; on this dataset we show that a fused method outperforms either local or global methods alone, and that continuous pose estimation improves over discrete output models. Finally, we demonstrate our autonomous object detection and grasping system on the Willow Garage PR2 robot.
Article
A parsing algorithm which seems to be the most efficient general context-free algorithm known is described. It is similar to both Knuth's LR( k ) algorithm and the familiar top-down algorithm. It has a time bound proportional to n ³ (where n is the length of the string being parsed) in general; it has an n ² bound for unambiguous grammars; and it runs in linear time on a large class of grammars, which seems to include most practical context-free programming language grammars. In an empirical comparison it appears to be superior to the top-down and bottom-up algorithms studied by Griffiths and Petrick.
Learning how objects function via co-analysis of interactions
  • Ruizhen Hu
  • Bojian Oliver Van Kaick
  • Hui Wu
  • Ariel Huang
  • Hao Shamir
  • Zhang
Pointnet++: Deep hierarchical feature learning on point sets in a metric space
  • Li Charles Ruizhongtai Qi
  • Hao Yi
  • Leonidas J Su
  • Guibas
A scalable active framework for region annotation in 3d shape collections
  • Li Yi
  • G Vladimir
  • Duygu Kim
  • I-Chao Ceylan
  • Mengyan Shen
  • Hao Yan
  • Su
Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding
  • Kaichun Mo
  • Shilin Zhu
  • X Angel
  • Li Chang
  • Subarna Yi
  • Leonidas J Tripathi
  • Guibas