Yoichi Sato

Yoichi Sato
  • Ph.D.
  • Professor (Full) at The University of Tokyo

About

359
Publications
71,426
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
13,725
Citations
Current institution
The University of Tokyo
Current position
  • Professor (Full)

Publications

Publications (359)
Preprint
Full-text available
We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images acce...
Preprint
We present a contrastive learning framework based on in-the-wild hand images tailored for pre-training 3D hand pose estimators, dubbed HandCLR. Pre-training on large-scale images achieves promising results in various tasks, but prior 3D hand pose pre-training methods have not fully utilized the potential of diverse hand images accessible from in-th...
Article
Full-text available
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to upho...
Preprint
Full-text available
In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pede...
Preprint
Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects a...
Preprint
Full-text available
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel m...
Article
Full-text available
The task of few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. How to better describe the action in each video and how to compare the similarity between videos are two of the most critical factors in this task. Directly describing the video globally or by its individual frames c...
Article
Full-text available
In this work, we present a novel method for simultaneously controlling the head pose and the facial expressions of a given input image using a 3D keypoint-based GAN. Existing methods for controlling head pose and expressions simultaneously are not suitable for real images, or they generate unnatural results because it is not trivial to capture head...
Article
Full-text available
With the development of robotics and wearable devices, there is a need for information processing under the assumption that an agent itself is mobile. Especially, understanding an acoustic environment around an agent is an important issue. In this paper, we solve a task in which a moving agent estimates the Direction of Arrival (DoA) of the surroun...
Article
Full-text available
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications, such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the qu...
Preprint
Full-text available
The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfort...
Preprint
Full-text available
The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfort...
Preprint
Full-text available
Object affordance is an important concept in hand-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning. However, the definition of affordance in existing datasets often: 1) mix up affordance with o...
Article
Semantic segmentation achieves significant success through large-scale training data. Meanwhile, few-shot semantic segmentation was proposed to segment image regions of novel classes through few labeled testing data. However, it ignores classes previously learned from training data. This paper proposes a new segmentation framework called segmentati...
Preprint
Full-text available
Image cropping has progressed tremendously under the data-driven paradigm. However, current approaches do not account for the intentions of the user, which is an issue especially when the composition of the input image is complex. Moreover, labeling of cropping data is costly and hence the amount of data is limited, leading to poor generalization p...
Chapter
We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. Ho...
Article
People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. Thi...
Chapter
We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage. While generative neural approaches have demonstrated spectacular results on 2D images, they have not yet achieved similar photorealistic results in combination with scene completion where a spatial 3D scene understanding is essential...
Chapter
Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of global prototypes and a group of focused prototypes, and then compares video similarity based on the pr...
Chapter
Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships by LSTMs on spatially pooled short-term CNN features. However, this practice would inevitably neglect the diffe...
Preprint
Full-text available
Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships by LSTMs on spatially pooled short-term CNN features. However, this practice would inevitably neglect the diffe...
Preprint
Full-text available
We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage. While generative neural approaches have demonstrated spectacular results on 2D images, they have not yet achieved similar photorealistic results in combination with scene completion where a spatial 3D scene understanding is essential...
Preprint
Full-text available
Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of global prototypes and a group of focused prototypes, and then compares video similarity based on the pr...
Preprint
Full-text available
Object affordance is an important concept in human-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning. However, existing datasets often: 1) mix up affordance with object functionality; 2) confuse...
Preprint
Full-text available
We study the problem of identifying object instances in a dynamic environment where people interact with the objects. In such an environment, objects' appearance changes dynamically by interaction with other entities, occlusion by hands, background change, etc. This leads to a larger intra-instance variation of appearance than in static environment...
Preprint
Full-text available
In this survey, we present comprehensive analysis of 3D hand pose estimation from the perspective of efficient annotation and learning. In particular, we study recent approaches for 3D hand pose annotation and learning methods with limited annotated data. In 3D hand pose estimation, collecting 3D hand pose annotation is a key step in developing han...
Preprint
Full-text available
We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. Ho...
Preprint
Full-text available
Detecting the positions of human hands and objects-in-contact (hand-object detection) in each video frame is vital for understanding human activities from videos. For training an object detector, a method called Mixup, which overlays two training images to mitigate data bias, has been empirically shown to be effective for data augmentation. However...
Article
Full-text available
Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered. However, sound loca...
Preprint
Full-text available
The human gaze is a cost-efficient physiological data that reveals human underlying attentional patterns. The selective attention mechanism helps the cognition system focus on task-relevant visual clues by ignoring the presence of distractors. Thanks to this ability, human beings can efficiently learn from a very limited number of training samples....
Preprint
Full-text available
First-person action recognition is a challenging task in video understanding. Because of strong ego-motion and a limited field of view, many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. To encode more discriminative features, the model needs to have the ability to focus on...
Preprint
Full-text available
Every hand-object interaction begins with contact. Despite predicting the contact state between hands and objects is useful in understanding hand-object interactions, prior methods on hand-object analysis have assumed that the interacting hands and objects are known, and were not studied in detail. In this study, we introduce a video-based method f...
Preprint
Full-text available
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to upho...
Preprint
Full-text available
The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network. Regarding the attribution method for visually explaining video understanding networks, it is challenging because of the unique spatiotemporal dependencies...
Preprint
Full-text available
Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-awa...
Article
Full-text available
Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-awa...
Preprint
Full-text available
In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. Leveraging multiple modalities has been proved to benefit the Unsupervised Domain Adaptation (UDA) task. In this work, we present Multi-Modal Mutual Enhancement Module (M3EM), a deep modu...
Article
The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network. Regarding the attribution method for visually explaining video understanding networks, it is challenging because of the unique spatiotemporal dependencies...
Preprint
Full-text available
People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. Thi...
Article
Controlling the initiation of cell migration plays a fundamental role in shaping the tissue during embryonic development. During gastrulation in zebrafish, some mesendoderm cells migrate inward to form the endoderm as the innermost germ layer along the yolk syncytial layer. However, how the initiation of inward migration is regulated is poorly unde...
Article
In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context: the information from gaze prediction facilitates action recognition and vice versa. Our assumption is that during the procedure of performing a manipulation task, on the one hand, what a person is doing determ...
Article
We propose two online-learning algorithms for modeling the personal preferences of users of interactive systems. The proposed algorithms leverage user feedback to estimate user behavior and provide personalized adaptive recommendation for supporting context-dependent decision-making. We formulate preference modeling as online prediction algorithms...
Preprint
Full-text available
Identifying and visualizing regions that are significant for a given deep neural network model, i.e., attribution methods, is still a vital but challenging task, especially for spatio-temporal networks that process videos as input. Albeit some methods that have been proposed for video attribution, it is yet to be studied what types of network struc...
Article
Joint attention often happens during social interactions, in which individuals share focus on the same object. This article proposes an egocentric vision-based system (ego-vision system) that aims to discover the objects looked at jointly by a group of persons engaged in interactive activities. The proposed system relies on a collection of wearable...
Article
Eye gaze estimation is increasingly demanded by recent intelligent systems to facilitate a range of interactive applications. Unfortunately, learning the highly complicated regression from a single eye image to the gaze direction is not trivial. Thus, the problem is yet to be solved efficiently. Inspired by the two-eye asymmetry as two eyes of the...
Article
In this paper, we propose a restrained random-walk similarity method for detecting the community structures of graphs. The basic premise of our method is that the starting vertices of finite-length random walks are judged to be in the same community if the walkers pass similar sets of vertices. This idea is based on our consideration that a random...
Conference Paper
Full-text available
We present an assistive suitcase system, BBeep, for supporting blind people when walking through crowded environments. BBeep uses pre-emptive sound notifcations to help clear a path by alerting both the user and nearby pedestrians about the potential risk of collision. BBeep triggers notifca-tions by tracking pedestrians, predicting their future po...
Conference Paper
Full-text available
Research in group activity analysis has put attention to monitor the work and evaluate group and individual performance, which can be reflected towards potential improvements in future group interactions. As a new means to examine individual or joint actions in the group activity, our work investigates the potential of detecting and disambiguating...
Conference Paper
Full-text available
This paper presents CoSummary, an adaptive video fast-forwarding technique for browsing surgical videos recorded by wearable cameras. Current wearable technologies allow us to record complex surgical skills, however, an efficient browsing technique for these videos is not well established. In order to assist browsing surgical videos, our study focu...
Preprint
Full-text available
Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which has many important applications in domains such as health rehabilitation and manufacturing. However, previous methods used all video appearance as input and did not consider the attention mechani...
Preprint
Full-text available
In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context. Our assumption is that in the procedure of performing a manipulation task, what a person is doing determines where the person is looking at, and the gaze point reveals gaze and non-gaze regions which contain i...
Conference Paper
This work presents a novel user interface applying 3D visualization to understand complex group activities from multiple first-person videos. The proposed interface is designed to assist video viewers to easily understand the collaborative relationships of group activity based on where the individual worker is located in a workspace and how multipl...
Article
Full-text available
We present an accurate stereo matching method using local expansion moves based on graph cuts. This new move-making scheme is used to efficiently infer per-pixel 3D plane labels on a pairwise Markov random field (MRF) that effectively combines recently proposed slanted patch matching and curvature regularization terms. The local expansion moves are...
Chapter
We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks. Our assumption is that the high-level context of how a task is completed in a certain way has a strong influence on attention transition a...
Preprint
Full-text available
This paper proposes a novel method for understanding daily hand-object manipulation by developing computer vision-based techniques. Specifically, we focus on recognizing hand grasp types, object attributes and manipulation actions within an unified framework by exploring their contextual relationships. Our hypothesis is that it is necessary to join...
Article
Recently, many hyperspectral (HS) image superresolution methods that merge a low spatial resolution HS image and a high spatial resolution three-channel RGB image have been proposed in spectral imaging. A largely ignored fact is that most existing commercial RGB cameras capture high resolution images by a single CCD/CMOS sensor equipped with a colo...
Preprint
Full-text available
This work presents stochastic optimization methods targeted at least-squares problems involving Monte Carlo integration. While the most common approach to solving these problems is to apply stochastic gradient descent (SGD) or similar methods such as AdaGrad and Adam, which involve estimating a stochastic gradient from a small number of Monte Carlo...
Conference Paper
Full-text available
Despite the emphasis of involving users with disabilities in the development of accessible interfaces, user trials come with high costs and effort. Particularly considering the diverse range of abilities such as in the case of low vision, simulating the effect of an impairment on interaction with an interface has been approached. As a starting poin...
Conference Paper
This work presents the Dynamic Object Scanning (DO-Scanning), a novel interface that helps users browse long and untrimmed first-person videos quickly. The proposed interface offers users a small set of object cues generated automatically tailored to the context of a given video. Users choose which cue to highlight, and the interface in turn fast-f...
Conference Paper
This work presents the Dynamic Object Scanning (DO-Scanning), a novel interface that helps users browse long and untrimmed first-person videos quickly. The proposed interface offers users a small set of object cues generated automatically tailored to the context of a given video. Users choose which cue to highlight, and the interface in turn adapti...
Article
Full-text available
We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks. Our assumption is that the high-level context of how a task is completed in a certain way has a strong influence on attention transition a...
Conference Paper
Full-text available
This paper presents a novel interface to support video coding of social attention in the assessment of children with autism spectrum disorder. Video-based evaluations of social attention during therapeutic activities allow observers to find target behaviors while handling the ambiguity of attention. Despite the recent advances in computer vision-ba...
Conference Paper
Understanding the intention of other people is a fundamental social skill in human communication. Eye behavior is an important, yet implicit communication cue. In this work, we focus on enabling people to see the users' gaze associated with objects in the 3D space, namely, we present users the history of gaze linked to real 3D objects. Our 3D gaze...
Article
Full-text available
We present a new task that predicts future locations of people observed in first-person videos. Consider a first-person video stream continuously recorded by a wearable camera. Given a short clip of a person that is extracted from the complete stream, we aim to predict his location in future frames. To facilitate this future person localization abi...
Conference Paper
This work presents EgoScanning, a video fast-forwarding interface that helps users find important events from lengthy first-person videos continuously recorded with wearable cameras. This interface features an elastic timeline that adaptively changes playback speeds and emphasizes egocentric cues specific to first-person videos, such as hand manipu...
Conference Paper
Full-text available
Active involvement of users with disabilities is difficult to employ during the iterative stages of the design process due to high costs and effort associated with user studies. This research proposes a user centered design (UCD) strategy to incorporate the use of gaze-contingent tunnel vision simulation with sighted individuals to facilitate rapid...
Conference Paper
Full-text available
We propose a new multi-frame method for efficiently computing scene flow (dense depth and optical flow) and camera ego-motion for a dynamic scene observed from a moving stereo camera rig. Our technique also segments out moving objects from the rigid scene. In our method, we first estimate the disparity map and the 6-DOF camera motion using stereo m...

Network

Cited By