Siyu Tang’s research while affiliated with ETH Zurich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (96)


Figure C.2: We overfit two SplatFormers on 20 scenes with 2D partial or 3D direct supervision. We show the training curves and the OOD-view rendering of a training example. Minimizing 3D loss does not improve PSNR of the 2D renderings. Without fitting the 3D label with 100% accuracy, the model with 3D supervision cannot remove artifacts in 2D renderings.
Figure E.1: Failure Case. While our method effectively reduces artifacts in 3DGS (Kerbl et al., 2023) and outperforms SplatFields (Mihajlovic et al., 2024), it does not fully restore some highfrequency details. MipNeRF360 (Barron et al., 2022) excels in detail modeling but suffers from floating issues.
Figure F.2: Results on ShapeNet-OOD. We compare our method with baselines: SyncDreamer (Liu et al., 2023b), LaRa (Chen et al., 2024a), SSDNeRF (Chen et al., 2023), 3DGS (Kerbl et al., 2023), Nerfbusters (Warburg* et al., 2023), SplatFields (Mihajlovic et al., 2024), InstantSplat Fan et al. (2024), 2DGS (Huang et al., 2024a), FSGS (Zhu et al., 2024), InstantNGP (Müller et al., 2022), and MipNeRF360 (Barron et al., 2022).
Figure F.4: Results on GSO-OOD. We compare SplatFormer, trained on Objaverse scenes, with Nerfbusters (Warburg* et al., 2023), 2DGS (Huang et al., 2024a) and MipNeRF360 (Barron et al., 2022).
OOD-NVS. Comparisons on the ShapeNet-OOD and Objaverse-OOD evaluation sets. The metric is evaluated on OOD test views with elevation ϕ ood ≥ 70 • ; colors indicate the 1st , 2nd , and 3rd best-performing model

+1

SplatFormer: Point Transformer for Robust 3D Gaussian Splatting
  • Preprint
  • File available

November 2024

·

13 Reads

Yutong Chen

·

·

Xiyi Chen

·

[...]

·

Siyu Tang

3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.

Download



DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

October 2024

·

15 Reads

Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model, DART, effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.


Figure 3. Different Illumination Inference Method. (a) During training and relighting, we use the first split-sum to compute the direct illumination. (b) During training, we use an MLP to predict indirect illumination and blend with direct illumination using the occlusion probability. (c) During relighting, we use a second split-sum with one additional ray bounce to compute the indirect illumination.
Figure 5. Ablation study for indirect illumination.
Figure 10. Qualitative comparisons on material estimation. From top to bottom: toaster, helmet, car. In each figure, we show albedo on the left and roughness on the right.
RISE-SDF: a Relightable Information-Shared Signed Distance Field for Glossy Object Inverse Rendering

September 2024

·

9 Reads

In this paper, we propose a novel end-to-end relightable neural inverse rendering system that achieves high-quality reconstruction of geometry and material properties, thus enabling high-quality relighting. The cornerstone of our method is a two-stage approach for learning a better factorization of scene parameters. In the first stage, we develop a reflection-aware radiance field using a neural signed distance field (SDF) as the geometry representation and deploy an MLP (multilayer perceptron) to estimate indirect illumination. In the second stage, we introduce a novel information-sharing network structure to jointly learn the radiance field and the physically based factorization of the scene. For the physically based factorization, to reduce the noise caused by Monte Carlo sampling, we apply a split-sum approximation with a simplified Disney BRDF and cube mipmap as the environment light representation. In the relighting phase, to enhance the quality of indirect illumination, we propose a second split-sum algorithm to trace secondary rays under the split-sum rendering framework.Furthermore, there is no dataset or protocol available to quantitatively evaluate the inverse rendering performance for glossy objects. To assess the quality of material reconstruction and relighting, we have created a new dataset with ground truth BRDF parameters and relighting results. Our experiments demonstrate that our algorithm achieves state-of-the-art performance in inverse rendering and relighting, with particularly strong results in the reconstruction of highly reflective objects.


Fig. 1: SplatFields regularizes 3D Gaussian Splatting (3DGS) [29] by predicting the splat features and locations via neural fields to improve the reconstruction under unconstrained sparse views. We measure spatial autocorrelation (Moran's I [48]) of splat features in the local neighborhoods to assess their similarity and observe that better reconstruction quality achieved by our method corresponds to higher Moran's I. The figure presents the results of a static reconstruction from ten calibrated images from Blender dataset [47]. Metrics are reported on the full test set; the rendered view is a novel view.
Fig. 2: Overview. SplatFields takes as input a point cloud (e.g., initialized from SfM [67]), for which it models the geometric (position p k , scale s k , rotation O k ) and appearance attributes (color c k , opacity α k ). These attributes represent the point set as 3D splats that are then rendered with the 3DGS rasterizer [29]. First, the point location set {p k ∈ R 3 } K k=1 is encoded into features {f k } K k=1 by sampling the tri-plane representation generated by a CNN generator g θ to provide a deep structural prior [73] on the feature values. These values are then propagated through a deformation MLP fΘ to refine the point locationsˆplocationsˆ locationsˆp k . The new point set, along with the features, is then propagated through a series of compact neural fields to predict the properties of rendering primitives {G k } K k=1 that are rendered with respect to arbitrary viewpoints. During the optimization, we adopt the adaptive density control [29] to periodically prune and densify the point set. SplatFields seamlessly adapts to 4D reconstruction by conditioning neural fields on the time step t and introducing an extra time-conditioned flow field. Gray blocks indicate learnable modules.
Impact of the spatial autocorrelation on static scene reconstruction. Results on Owlii [80] dataset. See Section 5.1 for discussion
Monocular reconstruction of dynamic sequences from the NeRF-DS dataset [82] with recent state-of-the-art methods. The forward slash in FPS indicates the rendering speed without the neural network inference when the rendering primi- tives are extracted and stored for each frame vs. with the neural network inference
Multi-view reconstruction of dynamic sequences from the Owlii dataset [47] under varying number of input views. The reported metric is PSNR averaged across novel views. The forward slash in FPS indicates the rendering speed without the neural network inference when the rendering primitives are extracted and stored for each frame vs. with the neural network inference.
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction

September 2024

·

95 Reads

Digitizing 3D static scenes and 4D dynamic events from multi-view images has long been a challenge in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a practical and scalable reconstruction method, gaining popularity due to its impressive reconstruction quality, real-time rendering capabilities, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially severe in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation of splat features as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach effectively handles static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.


Fig. 5: Zero-shot reconstruction results. Our approach achieves faithful surface reconstruction for images generated from text. These images are produced using a pre-trained text-to-image model [36].
Fig. 6: Ablation study on a Shell scene. We report the PSNR for each example at the top. Here, the fast model corresponds to the full model detailed in Table 3.
Ablations. See Section 5.2 for descriptions.
LaRa: Efficient Large-Baseline Radiance Fields

July 2024

·

18 Reads

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360&deg radiance fields, and robustness to zero-shot and out-of-domain testing.





Citations (50)


... We compare our SAR3D with three categories of methods: single-image to 3D methods (Splatter-Image [68], OpenLRM [20,26]), multi-view image to 3D methods (One-2-3-45 [40], Lara [9], CRM [79], LGM [70]), Point-E LN3Diff Shap-E 3DTopia Figure 6. Comparison of Text-conditioned 3D Generation. ...

Reference:

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
LaRa: Efficient Large-Baseline Radiance Fields
  • Citing Chapter
  • November 2024

... Substantial research efforts have been directed towards robust 3D reconstruction with insufficient input views. First, some 3DGS variants regularize the Gaussian attributes through implicit bias in neural radiance fields (Mihajlovic et al., 2024) or geometry consistency terms (Huang et al., 2024a). Second, a number of methods attempt to exploit priors from external datasets. ...

SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction
  • Citing Chapter
  • October 2024

... A related area of research involves 3D reconstruction from sparse or monocular input views, where methods often need to hallucinate unseen content Liu et al., 2023b). While hallucination can be beneficial for creative applications, it may be undesirable in settings that demand accurate reconstructions, such as 3D visualization of surgical procedures (Hein et al., 2024), and unnecessary in typical daily capture scenarios. ...

Creating a Digital Twin of Spinal Surgery: A Proof of Concept
  • Citing Conference Paper
  • June 2024

... PhysAvatar [36] also relies on knowing the ground truth human and cloth geometry, and the proposed method builds on a physics-based simulator and rendering engine to optimize 4D Gaussians. IntrinsicAvatar [29] uses explicit Monte Carlo ray tracing to model secondary shadow effects. To our knowledge, our work is the first to propose learning physical properties of clothed human avatars modeled using 3DGS. ...

IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing
  • Citing Conference Paper
  • June 2024

... This allows a user to move in front of a webcam and control a Gaussian-enabled character in a video game or virtual reality environment. Several other works also animate human figures with Gaussians [3,13,17,21,26]. To reduce the computational costs associated with visualizing Gaussians, Svitov et al propose HAHA, which represents drivable avatars with a hybrid of Gaussians and mesh [23]. ...

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
  • Citing Conference Paper
  • June 2024

... To address this limitation, recent techniques leverage large datasets of 2D portrait images [45,65,92] and 3D scans [20,33,99,116] to train diffusion models that capture robust priors on human appearance, enabling the reconstruction of 2D [85], 3D [36,68], or 4D avatars [18] from a single reference image. Still, most diffusion-based methods focus on 2D representations [17,24,36,85,93], and inference with diffusion models is computationally expensive, which is a major obstacle to real-time rendering and animation. ...

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
  • Citing Conference Paper
  • June 2024

... The diffusion model shows great potential in motion generation tasks (Tevet et al., 2022;Zhou et al., 2023;Rempe et al., 2023;Li et al., 2023a;Tanaka & Fujiwara, 2023;Xu et al., 2023;Diller & Dai, 2024;Xu et al., 2024b;Karunratanakul et al., 2024). The most direct approach to modeling p θ (x f |m, x l ) is to use a conditional diffusion model. ...

Optimizing Diffusion Noise Can Serve As Universal Motion Priors
  • Citing Conference Paper
  • June 2024

... Owing to the difficulty in curating sufficiently large and diverse real data for embodied tasks, many works [199], [200], [201] train large multimodal language models using synthetic datasets or augment real datasets with synthetically-generated egocentric data. Dedicated frameworks for generating (e.g., LEAP [202], EgoGen [203]) or annotating (e.g., PARSE-Ego4D [204]) synthetic egocentric data have been proposed. Generally, the target tasks and specific interactions the embodied AI needs to handle (e.g., navigation, manipulation, human-machine dialogue) are predetermined and a suitable dataset is selected or generated. ...

EgoGen: An Egocentric Synthetic Data Generator
  • Citing Conference Paper
  • June 2024

... Point cloud is also a commonly used representation for human avatar. DPF [42] and NPC [49] apply Point-NeRF [60] for producing explicit surface points and learning non-rigid deformation of human avatars from RGB videos. However, with the MLPs used in Point-NeRF, these methods still struggle with the blurry rendering results and the computation efficiency problem. ...

Dynamic Point Fields
  • Citing Conference Paper
  • October 2023