Hanbyul Joo’s research while affiliated with Fico and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (38)


Ego4D: Around the World in 3,000 Hours of Egocentric Video
  • Article
  • Full-text available

July 2024

·

152 Reads

·

179 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Kristen Grauman

·

Andrew Westbury

·

Eugene Byrne

·

[...]

·

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/</uri

Download




Figure 6. Ablation details. The ablations performed in Table 2. a) NoVQ a+m: We remove the VQ-VAE component, taking as input the raw listener and outputting the raw listener. We take as input only the b) motion m or the c) audio a from the speaker, replacing the cross-modal transformer with a normal transformer for the single-modal case. d) a+m: Rather than using a cross-modal transformer, we simply pass each modality through a transformer and perform fusion via concatenation.
Person D. Comparison against ground truth annotations (GT) on in-the-wild data. ↓ indicates lower is better; for no arrow, closer to GT is better. We bold best performances that are statistically significant. For FD and P-FD, results shown in units indicated above.
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

April 2022

·

104 Reads

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.


BANMo: Building Animatable 3D Neural Models from Many Casual Videos

December 2021

·

67 Reads

Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .



Ego4D: Around the World in 3,000 Hours of Egocentric Video

October 2021

·

1,016 Reads

·

3 Citations

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/



FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration

August 2021

·

259 Reads

Most existing monocular 3D pose estimation approaches only focus on a single body part, neglecting the fact that the essential nuance of human motion is conveyed through a concert of subtle movements of face, hands, and body. In this paper, we present FrankMocap, a fast and accurate whole-body 3D pose estimation system that can produce 3D face, hands, and body simultaneously from in-the-wild monocular images. The core idea of FrankMocap is its modular design: We first run 3D pose regression methods for face, hands, and body independently, followed by composing the regression outputs via an integration module. The separate regression modules allow us to take full advantage of their state-of-the-art performances without compromising the original accuracy and reliability in practice. We develop three different integration modules that trade off between latency and accuracy. All of them are capable of providing simple yet effective solutions to unify the separate outputs into seamless whole-body pose estimation results. We quantitatively and qualitatively demonstrate that our modularized system outperforms both the optimization-based and end-to-end methods of estimating whole-body pose.


Citations (24)


... Two public data sets were selected for this work: CMU Panoptic (CMU) [43] and Human3.6 M (H3.6 M) [34]. These data sets were selected because they are public, large, and provide RGB-D data as well as 3D annotations for joint locations. ...

Reference:

The effect of depth data and upper limb impairment on lightweight monocular RGB human pose estimation models
Panoptic Studio: A Massively Multiview System for Social Interaction Capture
  • Citing Preprint
  • December 2016

... While recent works [11,27,87] employs raw pixels for prediction, we observe that pixel-space prediction forces models to attend to noisy, task-irrelevant details (e.g., textures, lighting) [30]. This issue is amplified in web-scale and crowd-sourced video datasets [29], where uncontrolled capture conditions introduce further variability. Inspired by joint-embedding predictive architectures (JEPA) [4,5,96], we propose using DINOv2 [62] spatial patch features as semantically rich representations. ...

Ego4D: Around the World in 3,000 Hours of Egocentric Video

IEEE Transactions on Pattern Analysis and Machine Intelligence

... This approach is difficult to scale to more common everyday objects. An alternative approach that emerged recently is to learn a 3D prior from 2D images only [58,67,74,81]. However, learning such priors from in-the-wild images is a major challenge and typically requires significant category-specific design choices, with various solutions having been proposed for different object categories [23,32,68,69,81]. ...

BANMo: Building Animatable 3D Neural Models from Many Casual Videos
  • Citing Conference Paper
  • June 2022

... With advancements in wearable camera technology, Ego4D [8] introduces the Natural Language Query (NLQ) task for egocentric video grounding. NLQ aims to identify the specific video moment that answers a questiontype query within an untrimmed egocentric video, as shown in Figure 1(b). ...

Ego4D: Around the World in 3,000 Hours of Egocentric Video
  • Citing Conference Paper
  • June 2022

... However, these works all concentrate on the individual speaker modeling, without considering the application of interactive scenarios, especially in modeling the response of listener. Recently, several researchers have explored latent space modeling to generate listener motions from speaker information [23,24,27,42,51,59]. Some of them have employed the simplistic emotion labels to control the generation of listener motions and achieved good results [23,42]. ...

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
  • Citing Conference Paper
  • June 2022

... This paper focuses on estimating 3D human poses from a single image using the SMPL model. State-of-the-art methods are typically optimization-based, refining poses iteratively to minimize the difference between projected and detected points (Joo, Neverova, and Vedaldi 2021), or regression-based, directly inferring pose parameters from images using deep learning (Choutas et al. 2022). However, most works focus solely on single pose estimation without incorporating language interaction, limiting practical usability. ...

Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation
  • Citing Conference Paper
  • December 2021

... Optimization-based methods fit a parametric model [55,62,90] to image cues such as keypoints [7,62,90], silhouettes [16,61], or body-part segmentation masks [45]. Learning-based methods directly infer body-model parameters from images [17,40,46,48,68,78] or videos [36,39]. However, some methods infer bodies in model-free fashion as vertices [41,51,52] or via implicit functions [56,70,89]. ...

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration
  • Citing Conference Paper
  • October 2021

... Bimanual interactions are fundamental in human activities, such as collaborative tasks, emotion expression, and intention communication. Understanding these interactions is crucial for applications in augmented reality (AR)/ virtual reality (VR) [15], [43], human-computer interaction (HCI), and social signal understanding [9], [10], [19], [28]- [31], [36]. Accurate modeling and real-time reconstruction of bimanual interactions not only enhance system responsiveness but also improve the user experience by enabling natural and intuitive interactions. ...

Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics
  • Citing Conference Paper
  • June 2021

... Foundation models and robotics. Large-scale internet pre-training has seen recent success in the domains of vision and natural language processing [9,45,46,47,9,48]. Recent work has investigated if these models can be trained and/or fine-tuned for downstream robotics tasks [32,49,50,11,51,52]. ...

Ego4D: Around the World in 3,000 Hours of Egocentric Video

... However, the optimization-based method is sensitive to the initial value and has iteration time overhead. Xiang et al. [44], Hassan et al. [45], Zhang et al. [46] analyzed the relationship between the image and the objects related to the human body. The regression-based method directly predicts the human body model through deep learning. ...

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild
  • Citing Chapter
  • October 2020

Lecture Notes in Computer Science