December 2024
·
1 Read
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
December 2024
·
1 Read
November 2024
·
3 Reads
Episodic memory retrieval aims to enable wearable devices with the ability to recollect from past video observations objects or events that have been observed (e.g., "where did I last see my smartphone?"). Despite the clear relevance of the task for a wide range of assistive systems, current task formulations are based on the "offline" assumption that the full video history can be accessed when the user makes a query, which is unrealistic in real settings, where wearable devices are limited in power and storage capacity. We introduce the novel task of Online Episodic Memory Visual Queries Localization (OEM-VQL), in which models are required to work in an online fashion, observing video frames only once and relying on past computations to answer user queries. To tackle this challenging task, we propose ESOM - Egocentric Streaming Object Memory, a novel framework based on an object discovery module to detect potentially interesting objects, a visual object tracker to track their position through the video in an online fashion, and a memory module to store spatio-temporal object coordinates and image representations, which can be queried efficiently at any moment. Comparisons with different baselines and offline methods show that OEM-VQL is challenging and ESOM is a viable approach to tackle the task, with results outperforming offline methods (81.92 vs 55.89 success rate %) when oracular object discovery and tracking are considered. Our analysis also sheds light on the limited performance of object detection and tracking in egocentric vision, providing a principled benchmark based on the OEM-VQL downstream task to assess progress in these areas.
November 2024
·
4 Reads
Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce \mnamelong, a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. \mname defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. %We empirically demonstrate \mname's superiority in domain adaptation tasks across four diverse images and video datasets. Remarkably, \mname surpasses previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
November 2024
·
10 Reads
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.
November 2024
·
3 Reads
·
1 Citation
November 2024
·
3 Reads
·
2 Citations
October 2024
·
1 Read
·
2 Citations
October 2024
·
3 Reads
October 2024
October 2024
·
51 Reads
... Ohkawa et al. [99] aids further adaptation by performing view-invariant pretraining and finetuning. Different from previous techniques, Quattrocchi et al. [100] proposes an adaptation technique for temporal action segmentation. ...
November 2024
... Hotspot anticipation [8,61,62,78,80] is a related problem of localizing interaction points as heatmaps in 2D video frames. However, this prior work has no persistent 3D spatial understanding. ...
October 2024
... In contrast, existing works primarily focus on mistake detection, which limits direct comparisons with YETI's broader intervention detection capabilities. PREGO [9] (Mistake Detection in PRocedural EGOcentric Videos) targets online mistake detection in a manner akin to YETI's real-time intervention detection. Nevertheless, PREGO is confined to the Mistake Detection intervention type and does not address other scenarios where AI intervention could be beneficial. ...
June 2024
... We use multiple domains of embodiment data related to robotic manipulations. In the largest dataset mixture, the 35 real robot datasets contain primarily the Open-X embodiment dataset [35], and 3 human video datasets contain ego-centric hand motions [12,13,17] and the 2 simulation datasets contain standard benchmark [32,54]. We follow the same dataset processing as in previous works [49] for using 2D hand detections as action label proxies. ...
June 2024
... Ego4D [12] features massive-scale egocentric videos covering a wide range of daily activities, with video durations ranging from 5 seconds to 7 hours. Several subsequent research [6,13,16,26,35] based on Ego4D have also emerged. They fall short of capturing the full scope of long-term daily human activities, as they mainly focus on isolated clips or recordings. ...
June 2024
... We demonstrate the effectiveness and efficiency of our approach on Ego4D [8], a large-scale egocentric vision dataset. To summarize, our main contributions are: 1) We introduce a unified video understanding architecture to learn multiple egocentric vision tasks with different temporal granularity, while requiring minimal task-specific overhead; 2) We present Temporal Distance Gated Convolution (TDGC), a novel GNN layer for egocentric vision tasks that require a strong sense of time; 3) We extend EgoPack to the Moment Queries task, which involves the localization of activities that range from a few seconds to several minutes in duration; 4) Hier-EgoPack achieves strong performance on five Ego4D [8] benchmarks, using the same architecture and showing the importance of cross-task interaction. ...
July 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence
... Egocentric vision captures human activities from the privileged perspective of the camera wearer, allowing a unique point of view on their actions [9], [10]. Recently, the field has seen rapid development thanks to the release of several large-scale egocentric vision datasets [8], [11]- [15]. ...
May 2024
International Journal of Computer Vision
... There has recently been a large interest in applying AI for industrial use cases, especially in what is known as the 4 th Industrial Revolution (or Industry 4.0) [19]. More recently, LLMs have been applied throughout the product development lifecycle [20], including design [15], [21] as well as with conversational assistants [22]. ...
January 2024
... Datasets such as Meccano [6], HRI30 [7], HA4M [8] and Enigma [9] have contributed significantly by capturing humanobject interactions and providing annotated multimodal data for task localization, object recognition, and action classification. While these datasets have advanced research in humanrobot collaboration, they still struggle to fully capture the intricate and dynamic nature of industrial workflows where tasks overlap and engagement levels fluctuate. ...
January 2024
... In fact, there are a lot of datasets available using the conventional third-person views; for this kind of task, comprehensive datasets are lacking. Most relevant datasets, like the one used in [32], rely on artificial data, where real 2D images were transformed into 3D artificial images of hands for further processing. In contrast, the HOI4D dataset, built by Liu and their team [33], is very suitable for our study due to the fact that it provides real image-based hand annotations. ...
March 2024
Computer Vision and Image Understanding