Ilija Radosavovic’s research while affiliated with University of Chicago and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (27)


Figure 2 Training Loss Curves: We show the training loss curves for base, large, and 1b models trained with tokens from dVAE (Ramesh et al., 2021) with a vocabulary size of 8k and context length of 4k tokens (equivalent to 16 images or video frames).
Figure 3 1-gram Distribution of Various Tokens: This Figure shows the distribution of 1-gram tokens of various tokenizers (dVAE (Ramesh et al., 2021), VQGAN-1k, VQGAN-16k (Esser et al., 2020)) on Imagenet validation set. Note that, dVAE has almost full convergence of the tokens while VQGAN has less than 50% coverage of the tokens.
Figure 4 Probing at Different Layers: We show the attentionprobing performance at each layer of our three models. Peak performance is observed at around 50% depth of the models.
Figure 5 Semi-Supervised Tracking: We follow the protocol in STC (Jabri et al., 2020), start with the GT segmentation mask, and propagate the labels using the features computed by Toto-large. The mask was propagated up to 60 frames without losing much information.
Figure 6 Robot Manipulation with Reinforcement Learning: We compare MAE-base (Radosavovic et al., 2022) with Toto-base pre-trained models in simulation following Xiao et al. (2022). We evaluate each model the mean success rate over training steps. Toto was able to learn these tasks faster than MAE, across two robots and two tasks.

+5

An Empirical Study of Autoregressive Pre-training from Videos
  • Preprint
  • File available

January 2025

·

6 Reads

·

Ilija Radosavovic

·

Rahul Ravishankar

·

[...]

·

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

Download

Learning Humanoid Locomotion over Challenging Terrain

October 2024

·

54 Reads

Humanoid robots can, in principle, use their legs to go almost anywhere. Developing controllers capable of traversing diverse terrains, however, remains a considerable challenge. Classical controllers are hard to generalize broadly while the learning-based methods have primarily focused on gentle terrains. Here, we present a learning-based approach for blind humanoid locomotion capable of traversing challenging natural and man-made terrain. Our method uses a transformer model to predict the next action based on the history of proprioceptive observations and actions. The model is first pre-trained on a dataset of flat-ground trajectories with sequence modeling, and then fine-tuned on uneven terrain using reinforcement learning. We evaluate our model on a real humanoid robot across a variety of terrains, including rough, deformable, and sloped surfaces. The model demonstrates robust performance, in-context adaptation, and emergent terrain representations. In real-world case studies, our humanoid robot successfully traversed over 4 miles of hiking trails in Berkeley and climbed some of the steepest streets in San Francisco.


Ego4D: Around the World in 3,000 Hours of Egocentric Video

July 2024

·

130 Reads

·

130 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/</uri





Real-world humanoid locomotion with reinforcement learning

April 2024

·

80 Reads

·

59 Citations

Science Robotics

Humanoid robots that can autonomously operate in diverse environments have the potential to help address labor shortages in factories, assist elderly at home, and colonize new planets. Although classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesized that the observation-action history contains useful information about the world that a powerful transformer model can use to adapt its behavior in context, without updating its weights. We trained our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deployed it to the real-world zero-shot. Our controller could walk over various outdoor terrains, was robust to external disturbances, and could adapt in context.


Robot Learning with Sensorimotor Pre-training

June 2023

·

36 Reads

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and past actions, we encode the interleaved sequence into tokens, mask out a random subset, and train a model to predict the masked-out content. We hypothesize that if the robot can predict the missing content it has acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to 10x larger models, and 10 Hz inference on a real robot. To evaluate our approach, we collect a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and model-based grasping algorithms. We find that pre-training on this data consistently outperforms training from scratch, leads to 2x improvements in the block stacking task, and has favorable scaling properties.


Fig. 3: External disturbance. We include carrying constant loads and withstanding external forces. See also Figure 1.
Learning Humanoid Locomotion with Transformers

March 2023

·

285 Reads

·

1 Citation

We present a sim-to-real learning-based approach for real-world humanoid locomotion. Our controller is a causal Transformer trained by autoregressive prediction of future actions from the history of observations and actions. We hypothesize that the observation-action history contains useful information about the world that a powerful Transformer model can use to adapt its behavior in-context, without updating its weights. We do not use state estimation, dynamics models, trajectory optimization, reference trajectories, or pre-computed gait libraries. Our controller is trained with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deployed to the real world in a zero-shot fashion. We evaluate our approach in high-fidelity simulation and successfully deploy it to the real robot as well. To the best of our knowledge, this is the first demonstration of a fully learning-based method for real-world full-sized humanoid locomotion.


Learning to Imitate Object Interactions from Internet Videos

November 2022

·

41 Reads

We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.


Citations (17)


... While no prior work is dedicated to direct 4D hand trajectory prediction, common practices often involve first predicting per-frame 3D hand poses, 'lifting' them to a world coordinate system, followed by test-time optimization [12,27,51]. The de-facto lift method uses a weak-tofull perspective transformation (Weak2Full) [53,58], which places the predicted hand at a certain distance given the camera intrinsics and predicted scale (Sec. 3.1). ...

Reference:

Predicting 4D Hand Trajectory from Monocular Videos
Reconstructing Hands in 3D with Transformers
  • Citing Conference Paper
  • June 2024

... Imitation learning (IL) is a widely adopted approach for training robot policies from human demonstrations [1,2]. However, even state-of-the-art IL policies can in some cases require on the order of hundreds [3,4] or up to tens of thousands [5][6][7] of crowdsourced, teleoperated demonstrations in order to achieve decent performance. These demonstrations are typically collected by teleoperating robots via virtual reality devices [8] or puppeteering interfaces [4,9,10]. ...

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
  • Citing Conference Paper
  • July 2024

... Imitation learning (IL) is a widely adopted approach for training robot policies from human demonstrations [1,2]. However, even state-of-the-art IL policies can in some cases require on the order of hundreds [3,4] or up to tens of thousands [5][6][7] of crowdsourced, teleoperated demonstrations in order to achieve decent performance. These demonstrations are typically collected by teleoperating robots via virtual reality devices [8] or puppeteering interfaces [4,9,10]. ...

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
  • Citing Conference Paper
  • May 2024

... Recognising hand-object interactions. Action recognition is one of the most actively researched areas in computer vision (Jhuang et al. 2013;Varol, Laptev, and Schmid 2017;Kantorov and Laptev 2014) and significant progress has been made with the availability of large-scale datasets (Sigurdsson et al. 2016;Kay et al. 2017;Karpathy et al. 2014;Grauman et al. 2022). Here, we focus on methods (Tekin, Bogo, and Pollefeys 2019;Yang et al. 2020) that simultaneously estimate hand(-object) poses and interactions from egocentric videos. ...

Ego4D: Around the World in 3,000 Hours of Egocentric Video

IEEE Transactions on Pattern Analysis and Machine Intelligence

... RL algorithms are widely used across classic tasks including locomotion, navigation, or manipulation, among others. Indeed, recent years have seen unprecedented improvements in the ability of quadruped robots [1], wheeledlegged robots [2], drone racing [3], humanoids [4], or bipedal robot sports [2]. Also, in the automation of machinery such as hydraulic excavators [5]. ...

Real-world humanoid locomotion with reinforcement learning
  • Citing Article
  • April 2024

Science Robotics

... The focus of this work is dense long-term anticipation. Unlike different research directions that frame anticipation as ordered [12,33,3,6,61,34] or undordered [37,36,64,61] duration-agnostic transcript prediction problem, dense anticipation requires future actions to be predicted for a predefined number of future frames. This involves the estimation of both the order of actions and their durations. ...

Ego4D: Around the World in 3,000 Hours of Egocentric Video
  • Citing Conference Paper
  • June 2022

... If a model can achieve such outcomes, we can use the model to generate novel images by first sampling multivariate Gaussian noise and then iteratively removing from the current state of the image the noise predicted by our model. This classic formulation of DDPMs has achieved significant results in the space of image generation (Rombach et al. (2022)), audio synthesis (Kong et al. (2020)), and even meta-learning by learning how to conditionally generate neural network checkpoints (Peebles et al. (2022)). Furthermore, such an approach to generative modeling has expanded its reach to encompass scientific disciplines such as computational biology (Anand & Achim (2022)), computational chemistry ), and even computational physics (Mudur & Finkbeiner (2022)). ...

Learning to Learn with Generative Models of Neural Network Checkpoints
  • Citing Preprint
  • September 2022

... While most existing research has primarily focused on single-hand (Ge et al. 2019;Boukhayma, Bem, and Torr 2019;Baek, Kim, and Kim 2019;Simon et al. 2017;Zimmermann and Brox 2017) or objects (Li et al. 2018;Zheng et al. 2022;Wang et al. 2019;Lepetit, Pilet, and Fua 2004;Zheng et al. 2022Zheng et al. , 2024 in isolation, recently there has been a surge in interest of joint understanding of hand-object pose estimation. As the problem of reconstructing both hand and object is extremely ill-posed due to heavy mutual occlusions, many works (Cao et al. 2021;Tse et al. 2022b;Liu et al. 2021;Yang et al. 2021;Hampali et al. 2022;Yang et al. 2022) reduce this problem to 6D pose estimation with instancespecific templates. Meanwhile, some previous efforts (Hasson et al. 2019;Tse et al. 2022a;Ye, Gupta, and Tulsiani 2022;Chen et al. 2022;Ye et al. 2023a;Chen et al. 2023) do not assume to have access to ground-truth object models at test time and follow a template-free paradigm. ...

Reconstructing Hand-Object Interactions in the Wild
  • Citing Conference Paper
  • October 2021

... Policy Distillation. Policy distillation [6,7,12,15,35,39,41,42,49] provides an effective approach for transferring knowledge from high-performance policies to a single universal policy, promoting both model compactness and generalization across diverse tasks. ...

State-Only Imitation Learning for Dexterous Manipulation
  • Citing Conference Paper
  • September 2021