Vladlen Koltun’s research while affiliated with Intel and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (137)


Conformation Generation using Transformer Flows
  • Preprint

November 2024

Sohil Atul Shah

·

Vladlen Koltun

Estimating three-dimensional conformations of a molecular graph allows insight into the molecule's biological and chemical functions. Fast generation of valid conformations is thus central to molecular modeling. Recent advances in graph-based deep networks have accelerated conformation generation from hours to seconds. However, current network architectures do not scale well to large molecules. Here we present ConfFlow, a flow-based model for conformation generation based on transformer networks. In contrast with existing approaches, ConfFlow directly samples in the coordinate space without enforcing any explicit physical constraints. The generative procedure is highly interpretable and is akin to force field updates in molecular dynamics simulation. When applied to the generation of large molecule conformations, ConfFlow improve accuracy by up to 40%40\% relative to state-of-the-art learning-based methods. The source code is made available at https://github.com/IntelLabs/ConfFlow.


Cut Your Losses in Large-Vocabulary Language Models

November 2024

Erik Wijmans

·

Brody Huval

·

Alexander Hertzberg

·

[...]

·

Philipp Krähenbühl

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.


Does Spatial Cognition Emerge in Frontier Models?

October 2024

·

7 Reads

Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.


Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

October 2024

·

17 Reads

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro



Domain Generalization without Excess Empirical Risk

August 2023

·

5 Reads

Given data from diverse sets of distinct distributions, domain generalization aims to learn models that generalize to unseen distributions. A common approach is designing a data-driven surrogate penalty to capture generalization and minimize the empirical risk jointly with the penalty. We argue that a significant failure mode of this recipe is an excess risk due to an erroneous penalty or hardness in joint optimization. We present an approach that eliminates this problem. Instead of jointly minimizing empirical risk with the penalty, we minimize the penalty under the constraint of optimality of the empirical risk. This change guarantees that the domain generalization penalty cannot impair optimization of the empirical risk, i.e., in-distribution performance. To solve the proposed optimization problem, we demonstrate an exciting connection to rate-distortion theory and utilize its tools to design an efficient method. Our approach can be applied to any penalty-based domain generalization method, and we demonstrate its effectiveness by applying it to three examplar methods from the literature, showing significant improvements.


An Extensible, Data-Oriented Architecture for High-Performance, Many-World Simulation

July 2023

·

23 Reads

·

5 Citations

ACM Transactions on Graphics

Training AI agents to perform complex tasks in simulated worlds requires millions to billions of steps of experience. To achieve high performance, today's fastest simulators for training AI agents adopt the idea of batch simulation: using a single simulation engine to simultaneously step many environments in parallel. We introduce a framework for productively authoring novel training environments (including custom logic for environment generation, environment time stepping, and generating agent observations and rewards) that execute as high-performance, GPU-accelerated batched simulators. Our key observation is that the entity-component-system (ECS) design pattern, popular for expressing CPU-side game logic today, is also well-suited for providing the structure needed for high-performance batched simulators. We contribute the first fully-GPU accelerated ECS implementation that natively supports batch environment simulation. We demonstrate how ECS abstractions impose structure on a training environment's logic and state that allows the system to efficiently manage state, amortize work, and identify GPU-friendly coherent parallel computations within and across different environments. We implement several learning environments in this framework, and demonstrate GPU speedups of two to three orders of magnitude over open source CPU baselines and 5-33× over strong baselines running on a 32-thread CPU. An implementation of the OpenAI hide and seek 3D environment written in our framework, which performs rigid body physics and ray tracing in each simulator step, achieves over 1.9 million environment steps per second on a single GPU.


Figure 1: (a) Adaptive Continual Memory (ACM) performs Memory.Retrieve and Memory.Insert operations on features of new incoming samples, extracted by a static, pretrained deep network. (b) Wall clock time overhead of ACM Memory after feature extraction (x-axis is log-scaled) on a 16-core i7 CPU server. The longest observed overhead time using 256 dim embeddings is 5ms on 40 million samples in memory.
Online Continual Learning Without the Storage Constraint
  • Preprint
  • File available

May 2023

·

44 Reads

Online continual learning (OCL) research has primarily focused on mitigating catastrophic forgetting with fixed and limited storage allocation throughout the agent's lifetime. However, the growing affordability of data storage highlights a broad range of applications that do not adhere to these assumptions. In these cases, the primary concern lies in managing computational expenditures rather than storage. In this paper, we target such settings, investigating the online continual learning problem by relaxing storage constraints and emphasizing fixed, limited economical budget. We provide a simple algorithm that can compactly store and utilize the entirety of the incoming data stream under tiny computational budgets using a kNN classifier and universal pre-trained feature extractors. Our algorithm provides a consistency property attractive to continual learning: It will never forget past seen data. We set a new state of the art on two large-scale OCL datasets: Continual LOCalization (CLOC), which has 39M images over 712 classes, and Continual Google Landmarks V2 (CGLM), which has 580K images over 10,788 classes -- beating methods under far higher computational budgets than ours in terms of both reducing catastrophic forgetting of past data and quickly adapting to rapidly changing data streams. We provide code to reproduce our results at \url{https://github.com/drimpossible/ACM}.

Download

Drinking From a Firehose: Continual Learning With Web-Scale Natural Language

October 2022

·

16 Reads

·

23 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Continual learning systems will interact with humans, with each other, and with the physical world through time - and continue to learn and adapt as they do. An important open problem for continual learning is a large-scale benchmark which enables realistic evaluation of algorithms. In this paper, we study a natural setting for continual learning on a massive scale. We introduce the problem of personalized online language learning (POLL), which involves fitting personalized language models to a population of users that evolves over time. To facilitate research on POLL, we collect massive datasets of Twitter posts. These datasets, Firehose10M and Firehose100M, comprise 100 million tweets, posted by one million users over six years. Enabled by the Firehose datasets, we present a rigorous evaluation of continual learning algorithms on an unprecedented scale. Based on this analysis, we develop a simple algorithm for continual gradient descent (ConGraD) that outperforms prior continual learning methods on the Firehose datasets as well as earlier benchmarks. Collectively, the POLL problem setting, the Firehose datasets, and the ConGraD algorithm enable a complete benchmark for reproducible research on web-scale continual learning.


ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception

October 2022

·

24 Reads

·

16 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU. Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction. Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch. To achieve this, we 1) detach stored key-value data from the low-level hash map implementation; 2) bridge the pointer-first low level data structures to index-first high-level tensor interfaces via an index heap; 3) adapt both generic and non-generic integer-only hash map implementations as backends to operate on multi-dimensional keys. We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including 1) point cloud voxelization, 2) retargetable volumetric scene reconstruction, 3) non-rigid point cloud registration and volumetric deformation, and 4) spatially varying geometry and appearance refinement. ASH and its example applications are open sourced in Open3D ( http://www.open3d.org).</uri


Citations (49)


... Our experiment aims to demonstrate that an LLM, even without fine-tuning, improves the generalization capability of an end-to-end model. Inspired by OpenBot [50,51] and their early implementation [52], we conducted experiments in a real-world setting on a self-driving robot that integrates OpenBot on a commercial off-the-shelf RC vehicle with a smartphone as the embedded system onboard. We chose the iPhone in our hardware implementation because of its excellent support for PyTorch. ...

Reference:

Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs
OpenBot-Fleet: A System for Collective Learning with Real Robots
  • Citing Conference Paper
  • May 2024

... General Multi-agent Simulation. While LLM agent simulation may appear superficially similar to general multiagent simulations, such as those used in reinforcement learning (Brockman et al., 2016;Zhu et al., 2024;Shacklett et al., 2023) and multi-agent processing (Emau et al., 2011), they present fundamentally different scheduling challenges due to the high per-agent computational demands and significant workload imbalances inherent in LLM execution, as discussed in §2.2. These unique demands necessitate specialized scheduling strategies. ...

An Extensible, Data-Oriented Architecture for High-Performance, Many-World Simulation
  • Citing Article
  • July 2023

ACM Transactions on Graphics

... Since continual learning generally exists in various scenarios [21,[55][56][57] of deep neural networks, its related research is quite extensive. In addition to the work related to the innovation and application of continual learning methods, the analysis [35,37] and overview [32,34] work also have attracted much attention. ...

Drinking From a Firehose: Continual Learning With Web-Scale Natural Language
  • Citing Article
  • October 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... 3DGS [22] and Open-vocabulary 3D instance retrieval (OVIR-3D) [13] are adopted for reconstruction tasks. A modern framework for Parallel spatial hashing (ASH) [2] is used for scene geometry acquisition and noise point pruning of 3DGS. In the Task generation module (Section III-C), surface uncertainty is obtained through 3DGS, and reconstruction instances are acquired through OVIR-3D segmentation. ...

ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception
  • Citing Article
  • October 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... To address the limited availability of paired low-light and normal-light image datasets, approaches such as [1] and [37] generate synthetic low-light images derived from well-lit images. While synthetic data is more abundant than real paired datasets, its use in supervised training constrains model generalization to unseen, real-world dark images. ...

Dancing under the stars: video denoising in starlight
  • Citing Conference Paper
  • June 2022

... Since the proposed Progressive Domain Adaptation TIR tracking framework requires pairs of TIR samples for training, the collected TIR dataset does not contain label information. To this end, we utilized three kinds of methods, including Dynamic Programmingbased recognition [37], the UniDet object-detection model [19], and the Segment Anything Model [20] to generate pseudo-label pairs. As shown in Figure 3, we compared three methods for obtaining numerous potential target areas within TIR training data. ...

Simple Multi-dataset Detection
  • Citing Conference Paper
  • June 2022

... large-scale data with pretext tasks [63]. However, while diverse types of sensors [16,19,31,42,43,71,76,88] are applied in various domains in the world, e.g., medical imaging, robotics, and fundamental science, not all of them benefit from the development of foundation models. This is because it is challenging for other sensors [54,83] to collect large-scale training data like natural images, as shown in Figure 1. ...

Shape from Polarization for Complex Scenes in the Wild
  • Citing Conference Paper
  • June 2022

... Its core lies in the ability to calculate gradients through backpropagation, inspired by neural networks [30]. It has been applied to tasks like soft robot control [31], [32], material parameter estimation [33], [34], and accelerating reinforcement learning [35], [36]. Among other discretizations, the material point method [37], [38] has enjoyed differentiable formulations and implementations, e.g. ...

Differentiable Simulation of Soft Multi-body Systems
  • Citing Preprint
  • May 2022

... Synthetic data generation creates large-scale, automatically labeled datasets through computer graphics and simulation. Methods have advanced from basic 3D rendering [247] to sophisticated approaches incorporating domain randomization [248], physics-based rendering [249], and generative models [250]. Vijay et al. [251] generated 2,000 synthetic accident videos from multiple perspectives using gaming platforms, while Richter et al. [250] enhanced synthetic traffic scene realism through multi-level adversarial training. ...

Enhancing Photorealism Enhancement

IEEE Transactions on Pattern Analysis and Machine Intelligence