Huaizu Jiang’s research while affiliated with Northeastern University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (40)


Figure 4. Detailed analyses of our 4D optimization strategies. (a) Our Stage-2 refinement can effectively reduce the artifacts in dynamic NeRF caused by inconsistent novel-view video synthesis (NVVS). (b) We compute soft visibility maps as view-dependent loss weights based on surface normal estimates to further mitigate texture inconsistency. (c) The proposed progressive frame and orthogonal view sampling are shown to facilitate the learning of temporal deformation and capture better details in motion.
Figure 9. Dependency of reference multi-views. SV4D [72] relies on the reference multi-views produced by SV3D [64], which often conflicts with the later frames of the input video (e.g., jacket hood of the snowboarder and three arms of the dancing women) and leads to blurry outputs. In contrast, SV4D 2.0 can better leverage the information in all input frames to produce sharper and more faithful details.
Evaluation of 4D outputs on the ObjaverseDy dataset. SV4D 2.0 consistently outperforms baselines in all metrics.
Evaluation of NVVS on Consistent4D. SV4D 2.0 achieves a consistent performance gain on image and video metrics.
Evaluation of 4D outputs on the Consistent4D dataset. SV4D 2.0 achieves state-of-the-art visual quality (LPIPS, CLIP-S) and temporal smoothness (FVD-F) compared to prior methods.
SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
  • Preprint
  • File available

March 2025

·

2 Reads

Chun-Han Yao

·

Yiming Xie

·

·

[...]

·

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: https://sv4d2.0.github.io.

Download

Fig. 2: (a) Visual depiction of our masking scheme on a single CT scan slice: lighter patches, which mostly overlap with vessels (cyan) areas, are masked; the model can only see the darker areas during pre-training. (b) Illustration of our MAE pipeline, with both the CT scan and the distance map being reconstructed.
Fig. 3: 3D view and corresponding CTA images. Red: ground-truth aneurysm; Yellow: algorithm output; Blue: artery segmentation. Top row (all TP): Right MCA aneurysm (A, B), anterior communicating artery aneurysm (smaller) and left posterior communicating artery aneurysm (larger) (C,D), left ICA aneurysm (E, F). Bottom row: FP-basilar tip confluence (G, H), FP-posterior communicating artery infundibulum (I, J), FN-small ICA aneurysm (K, L)
Detection performance with t IoU = 0.3 and assuming a fixed FPr=0.5. We report Se@FPr curves to illustrate the threshold-agnostic performance of each model in Fig 1. Out-of-distribution (O.O.D.) datasets are highlighted. The private partition contains no healthy patients.
Anatomically-guided masked autoencoder pre-training for aneurysm detection

February 2025

·

22 Reads

Intracranial aneurysms are a major cause of morbidity and mortality worldwide, and detecting them manually is a complex, time-consuming task. Albeit automated solutions are desirable, the limited availability of training data makes it difficult to develop such solutions using typical supervised learning frameworks. In this work, we propose a novel pre-training strategy using more widely available unannotated head CT scan data to pre-train a 3D Vision Transformer model prior to fine-tuning for the aneurysm detection task. Specifically, we modify masked auto-encoder (MAE) pre-training in the following ways: we use a factorized self-attention mechanism to make 3D attention computationally viable, we restrict the masked patches to areas near arteries to focus on areas where aneurysms are likely to occur, and we reconstruct not only CT scan intensity values but also artery distance maps, which describe the distance between each voxel and the closest artery, thereby enhancing the backbone's learned representations. Compared with SOTA aneurysm detection models, our approach gains +4-8% absolute Sensitivity at a false positive rate of 0.5. Code and weights will be released.


Diagnosing Human-Object Interaction Detectors

February 2025

·

5 Reads

·

1 Citation

International Journal of Computer Vision

We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on mAP (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (e.g., why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper to offer a detailed quantitative breakdown of HOI detection models, inspired by the success of object detection diagnosis tools. We first conduct a holistic investigation into the HOI detection pipeline. By defining a set of errors and using oracles to fix each one, we quantitatively analyze the significance of different errors based on the mAP improvement gained from fixing them. Next, we explore the two key sub-tasks of HOI detection: human-object pair localization and interaction classification. For the pair localization task, we compute the coverage of ground-truth human-object pairs and assess the noisiness of the localization results. For the classification task, we measure a model’s ability to distinguish between positive and negative detection results and to classify actual interactions when human-object pairs are correctly localized. We analyze eight state-of-the-art HOI detection models, providing valuable diagnostic insights to guide future research. For instance, our diagnosis reveals that the state-of-the-art model RLIPv2 outperforms others primarily due to its significant improvement in multi-label interaction classification accuracy. Our toolbox is applicable across various methods and datasets and is available at https://neu-vi.github.io/Diag-HOI/.


Rethinking Diffusion for Text-Driven Human Motion Generation

November 2024

·

8 Reads

Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.





Fig. 1: End Point Error (EPE) of KITTI and Sintel datasets vs. Frames Per Second (FPS) throughput on an edge computing platform (Jetson Orin Nano). Individual points represent a broad class of optical flow methods. Our algorithm is comparable in accuracy but significantly more efficient, approaching an order of magnitude improvement in computational complexity. All models were trained solely on the FlyingThings and FlyingChairs datasets.
Fig. 2: Examples: We run NeuFlow v2 on unseen real-world images to showcase the model's generalization capabilities.
Fig. 3: NeuFlow v2 Architecture: We begin with a simple CNN backbone that outputs features and context at 1/8 and 1/16 scales for both images. The feature vectors at a 1/16 scale are then fed into cross-attention layers for feature enhancement. Next, we perform global matching to obtain an initial 1/16 flow, which is refined through one iteration. This flow is upsampled to a 1/8 scale and further refined over eight iterations. The refined 1/8 flow is then upsampled to full resolution using a convex upsampling module. The entire design follows the principle of global attention followed by local refinement. Details of the simple backbone and refinement module are presented in Figures 3 and 4.
Fig. 5: NeuFlow Simple RNN Refinement: We first compute the correlation within nearby pixels and warp these values using the currently estimated flow. The warped correlation, current estimated flow, context features, and hidden state are then fed into a series of 3x3 convolution layers followed by ReLU activation, repeated eight times. At the end of these layers, the network outputs both the refined flow and an updated hidden state for the next iteration. Instead of using GRU or LSTM modules, we simply use CNNs to generate the hidden state. A hard tanh function is applied to constrain the hidden state within a certain range, ensuring numerical stability.
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

August 2024

·

110 Reads

Real-time high-accuracy optical flow estimation is crucial for various real-world applications. While recent learning-based optical flow methods have achieved high accuracy , they often come with significant computational costs. In this paper, we propose a highly efficient optical flow method that balances high accuracy with reduced computational demands. Building upon NeuFlow v1, we introduce new components including a much more lightweight backbone and a fast refinement module. Both these modules help in keeping the computational demands light while providing close to state of the art accuracy. Compares to other state of the art methods, our model achieves a 10x-70x speedup while maintaining comparable performance on both synthetic and real-world data. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at https://github.com/neufieldrobotics/NeuFlow_v2.


Towards Flexible Visual Relationship Segmentation

August 2024

·

6 Reads

Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 mAP on HICO-DET, +11.4 Acc on VRD, +4.7 mAP on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.


Evaluation of 4D Outputs on the Con- sistent4D Dataset. SV4D can achieve better vi- sual quality and video frame smoothness.
Evaluation of Different Sampling Strategies on ObjaverseDy Dataset. SV4D sam- pling can effectively generate full image matrixes with faithful consistency and visual details.
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

July 2024

·

44 Reads

We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that rely on separately trained generative models for video generation and novel view synthesis, we design a unified diffusion model to generate novel view videos of dynamic 3D objects. Specifically, given a monocular reference video, SV4D generates novel views for each video frame that are temporally consistent. We then use the generated novel view videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for cumbersome SDS-based optimization used in most prior works. To train our unified novel view video generation model, we curated a dynamic 3D object dataset from the existing Objaverse dataset. Extensive experimental results on multiple datasets and user studies demonstrate SV4D's state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works.


Citations (20)


... Existing stereo matching algorithms often use deep CNNs or complex transformers for feature extraction, which limits the efficiency of the models. Inspired by NeuFlow [30], which uses a shallow CNN to extract features at 1/8 and ...

Reference:

ThermoStereoRT: Thermal Stereo Matching in Real Time via Knowledge Distillation and Attention-based Refinement
NeuFlow: Real-time, High-accuracy Optical Flow Estimation on Robots Using Edge Devices
  • Citing Conference Paper
  • October 2024

... As a result, the research community has striven to develop automated solutions that can assist clinicians in detecting aneurysms. Most such solutions are deep-learning-based, with some achieving over 90% Sensitivity with False Positive (FP) rates below 2 per scan [8,27]. Fig. 1: Lesion-level Sensitivity vs FP rate curve for our best model compared with three baselines, measured across four datasets. ...

Vessel-Aware Aneurysm Detection Using Multi-scale Deformable 3D Attention
  • Citing Chapter
  • October 2024

... To address the need for simultaneously controlling both content and style, some recent works have merged style encoding with diffusion-based motion generation. Among these, the most recent and representative approach [71] augments a pre-trained latent diffusion model [6] with a style adaptor and classifier-based style guidance, achieving stylized motion from textual prompts and motion-style references. While effective, this method relies on additional training branches, which shares structural similarities with ControlNet [64] as shown in Figure 1, which increases model complexity and training overhead. ...

SMooDi: Stylized Motion Diffusion Model
  • Citing Chapter
  • September 2024

... To verify the effectiveness of the mode in out-of-domains, we adapt our model to the PhraseCut dataset which contains the additional 1271 categories in the test split based on 80 in COCO. Following (Sun et al. 2024;Yu, Seo, and Son 2023;Han et al. 2024), we utilize the mean Intersection over Union (mIoU) for Re-fCOCO series, a common metric for RIS. Following (Yu, Seo, and Son 2023; Wu et al. 2020), we report the overall Intersection over Union (oIoU) for the PhraseCut dataset. ...

Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
  • Citing Conference Paper
  • June 2024

... Subsequent methods addressed this limitation. ImGeoNet [34] improved scene geometry estimation by supervising voxel weights, while PARQ [41] combined pixel-aligned appearance features with geometric information for iterative prediction refinement. NeRF-based methods, such as NeRF-RPN [11], use neural radiance fields to predict voxel opacity but suffer from complexity and underutilized multi-view advantages. ...

Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection
  • Citing Conference Paper
  • October 2023

... RapidFlow [29] combines efficient NeXt1D convolution blocks with a fully recurrent structure to decrease computational costs. DCVNet [30] proposes constructing cost volumes with different dilation factors to capture small and large displacements simultaneously. NeuFlow v1 [31], our previous work, is the fastest optical flow method, being over ten times faster than mainstream optical flow methods while maintaining comparable accuracy on the Sintel and FlyingThings datasets. ...

DCVNet: Dilated Cost Volume Networks for Fast Optical Flow
  • Citing Conference Paper
  • January 2023

... We evaluated the ability of two multimodal variants of GPT-4, gpt-4-vision-preview and gpt-4o, to perform few-shot relational concept learning. We evaluated these models using two datasets: Bongard-HOI (Jiang et al., 2022), involving action-and event-based relations (human-object interaction) in naturalistic scenes, and the Synthetic Visual Reasoning Test (SVRT) (Fleuret et al., 2011), involving visuospatial relations in synthetically generated images. Both datasets test the ability to learn abstract relational concepts from a relatively small number of demonstrations. ...

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
  • Citing Conference Paper
  • June 2022

... These techniques often use pre-trained deep learning models, enhanced with 2D features from text embeddings. Methods include task-specific convolutional backbones [21,56,94,118,164,166,194] and generative models like diffusion models [19,89,185] for single object 3D reconstructions. Lin et al. [83] use a Vision Transformer (ViT) pre-trained backbone to learn global features. ...

PlanarRecon: Realtime 3D Plane Detection and Reconstruction from Posed Monocular Videos

... The ability to understand object states and relationships is essential for a wide range of tasks in computer vision and robotics, including scene understanding, robotic manipulation, and high-level planning (Yao et al., 2018;Yuan et al., 2022). Earlier works that focus on a similar task of visual relationship detection learn to extract object-centric representations from raw images and make predictions based on them (Gkioxari et al., 2018;Yao et al., 2018;Ma et al., 2022;Yuan et al., 2022). A more recent approach by Yuan et al. (2022) specifically addresses state classification by extracting object-centric embeddings from RGB images and feeding them into trained networks to classify a set of predefined predicates. ...

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
  • Citing Article
  • April 2022

... In VQA, agents are required to combine natural language and visual cues to answer questions [1,17,19,23]. For 2D puzzles, tasks involve discovering relationships among visual elements and making inferences [22,26,39,58,59]. Physical dynamics prediction tasks require machines to perceive and reason about physical interactions [2,12,13,20]. ...

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
  • Citing Article
  • May 2022