Demetri Terzopoulos’s research while affiliated with University of California, Los Angeles and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (388)


Figure 1. Visual results generated by Wonderland. Given a single image, Wonderland reconstructs 3D scenes from the latent space of a camera-guided video diffusion model in a feed-forward manner.
Wonderland: Navigating 3D Scenes from a Single Image
  • Preprint
  • File available

December 2024

·

11 Reads

Hanwen Liang

·

Junli Cao

·

Vidit Goel

·

[...]

·

Jian Ren

This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

Download

TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

December 2024

·

15 Reads

Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.


Inverse Attention Agent for Multi-Agent System

October 2024

·

29 Reads

A major challenge for Multi-Agent Systems is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind, implemented algorithmically using an attention mechanism and trained in an end-to-end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent's final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.



Table 5 :
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

July 2024

·

21 Reads

Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.


Unstructured moving least squares material point methods: a stable kernel approach with continuous gradient reconstruction on general unstructured tessellations

Computational Mechanics

The material point method (MPM) is a hybrid Eulerian Lagrangian simulation technique for solid mechanics with significant deformation. Structured background grids are commonly employed in the standard MPM, but they may give rise to several accuracy problems in handling complex geometries. When using (2D) unstructured triangular or (3D) tetrahedral background elements, however, significant challenges arise (e.g., cell-crossing error). Substantial numerical errors develop due to the inherent C0C0{\mathcal {C}}^0 continuity property of the interpolation function, which causes discontinuous gradients across element boundaries. Prior efforts in constructing C1C1{\mathcal {C}}^1 continuous interpolation functions have either not been adapted for unstructured grids or have only been applied to 2D triangular meshes. In this study, an unstructured moving least squares MPM (UMLS-MPM) is introduced to accommodate 2D and 3D simplex tessellation. The central idea is to incorporate a diminishing function into the sample weights of the MLS kernel, ensuring an analytically continuous velocity gradient estimation. Numerical analyses confirm the method’s capability in mitigating cell crossing inaccuracies and realizing expected convergence.


Cross-Slice Attention and Evidential Critical Loss for Uncertainty-Aware Prostate Cancer Detection

July 2024

·

15 Reads

Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice attention model that utilizes both global and local information, along with an evidential critical loss, to perform evidential deep learning for the detection in MR images of prostate cancer, one of the most common cancers and a leading cause of cancer-related death in men. We perform extensive experiments with our model on two different datasets and achieve state-of-the-art performance in prostate cancer detection along with improved epistemic uncertainty estimation. The implementation of the model is available at https://github.com/aL3x-O-o-Hung/GLCSA_ECLoss.




An Interactive Agent Foundation Model

February 2024

·

152 Reads

·

1 Citation

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.


Citations (52)


... Pan et al. (2024) propose a huge scale of agent simulation, increasing the number of agents to 10 6 . In social game,like Werewolf (Xu et al., 2024) , Avalon (Lan et al., 2024) , and Minecraft (Gong et al., 2024) for MGAS simulation are attempted. Further, some game companies like Netease are also actively experimenting with MGAS in their games. ...

Reference:

A Survey on Multi-Generative Agent System: Recent Advances and New Frontiers
MindAgent: Emergent Gaming Interaction
  • Citing Conference Paper
  • January 2024

... Incorporating information between nearby slices in volumetric data can be achieved in various ways. For example, CSA-Net [19] used pixel-level cross-slice attention to enhance the segmentation of a central slice, while CSAM [27] employed slice-level attention across feature maps at multiple scales. Both these methods followed a standard 2.5D approach by stacking neighboring slices. ...

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation
  • Citing Conference Paper
  • January 2024

... These include studies on remote sensing applications [8][9][10], shadows [11,12], food [13], web pages [14], oil spills [15], and camouflaged objects [16]. SAM fine-tuning has a particular focus on medical image segmentation [17][18][19][20][21] due to the limited availability of data in this field. In general, updating all parameters of the SAM is a time-consuming process. ...

Refining boundaries of the segment anything model in medical images using an active contour model
  • Citing Conference Paper
  • April 2024

... The second approach [10] [11]directly collects human demonstration data from real-world interactions with cloth, facing challenges such as the high costs of data collection equipment. Therefore, to overcome the significant sim-to-real [12] gaps and reduce the dependency on costly experimental setups for data collection in the real world, a new approach is essential that leverages more accessible and scalable methods of acquiring human demonstration data for robotic cloth manipulation. In the offline stage, human demonstration data is captured using hand-tracking techniques, and a neural network is trained and optimized iteratively. ...

Learning Neural Force Manifolds for Sim2Real Robotic Symmetrical Paper Folding
  • Citing Article
  • January 2024

IEEE Transactions on Automation Science and Engineering

... Several researchers have looked at the problem of using natural language as the interface between embodied agents, either in the form of task specifications [16,47,48,67], question answering [8,18,35,36], instruction following [2,12,13,23,40,41,56], or as means of task coordination [27,37]. VIMA-Bench [24] builds on previous efforts in language-guided robotic manipulation [38,46,63] and uses multi-modal prompts as uniform task specifications for object manipulation. ...

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
  • Citing Conference Paper
  • October 2023

... Mesh-based elasticity models have been a cornerstone in modeling dynamics within computer graphics [52]. One of the most commonly employed variants is the linear tetrahedral finite element method [7,9,16,49], which requires a robust surface-to-tetrahedron algorithm like TetWild [20]. ...

Elastically Deformable Models
  • Citing Chapter
  • August 2023

... The endpoints, sections, and intersections were classified and the problem of overlapping was solved by a shallow neural network predicting connection probabilities between endpoints. The author proposed some shortcomings of FASTDLO in [21], e.g., it struggled to solve some complicated scenes of DLOs such as high curvature and many overlaps and intersections. In [21] and [22], more robust algorithms were developed for the real-time detection of DLOs. ...

mBEST: Realtime Deformable Linear Object Detection Through Minimal Bending Energy Skeleton Pixel Traversals
  • Citing Article
  • August 2023

IEEE Robotics and Automation Letters

... It has been applied in medicine for joint diagnosis and prognosis [27,28]. Haque et al. developed MULTIMIX [29], which jointly learns disease classification and lesion segmentation in a cautiously supervised manner. Harutyunyan et al. employed LSTM for multi-task joint training [30], demonstrating excellent performance on multiple patient outcome prediction tasks. ...

Generalized Multi-Task Learning from Substantially Unlabeled Multi-Source Medical Image Data
  • Citing Article
  • October 2021

The Journal of Machine Learning for Biomedical Imaging