Zhixiang Min’s research while affiliated with Stevens Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (10)


Geometric Viewpoint Learning with Hyper-Rays and Harmonics Encoding
  • Conference Paper

October 2023

·

3 Reads

Zhixiang Min

·

Juan Carlos Dibene

·




Figure 3. Illustration of the scene-centric and object-centric training scheme that are shown to be an important design choice.
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
  • Preprint
  • File available

May 2023

·

37 Reads

Monocular 3D object localization in driving scenes is a crucial task, but challenging due to its ill-posed nature. Estimating 3D coordinates for each pixel on the object surface holds great potential as it provides dense 2D-3D geometric constraints for the underlying PnP problem. However, high-quality ground truth supervision is not available in driving scenes due to sparsity and various artifacts of Lidar data, as well as the practical infeasibility of collecting per-instance CAD models. In this work, we present NeurOCS, a framework that uses instance masks and 3D boxes as input to learn 3D object shapes by means of differentiable rendering, which further serves as supervision for learning dense object coordinates. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes, while properly handling single-view ambiguities. Furthermore, we study and make critical design choices to learn object coordinates more effectively from an object-centric view. Altogether, our framework leads to new state-of-the-art in monocular 3D localization that ranks 1st on the KITTI-Object benchmark among published monocular methods.

Download


Figure 1. LASER diagram. Learning-based MCL frameworks encode camera pose hypotheses and query images into a common metric space to measure their similarities. Compared to existing works, LASER directly renders latent features and has a reduced sampling dimension.
Figure 6. Qualitative study on method robustness under challenging cases. Success, ambiguous and failure cases are placed inside green, yellow and red boxes respectively. The GT locations are circled red and ambiguities are circled yellow in the posterior maps where floor maps are overlayed.
LASER: LAtent SpacE Rendering for 2D Visual Localization

March 2022

·

52 Reads

We present LASER, an image-based Monte Carlo Localization (MCL) framework for 2D floor maps. LASER introduces the concept of latent space rendering, where 2D pose hypotheses on the floor map are directly rendered into a geometrically-structured latent space by aggregating viewing ray features. Through a tightly coupled rendering codebook scheme, the viewing ray features are dynamically determined at rendering-time based on their geometries (i.e. length, incident-angle), endowing our representation with view-dependent fine-grain variability. Our codebook scheme effectively disentangles feature encoding from rendering, allowing the latent space rendering to run at speeds above 10KHz. Moreover, through metric learning, our geometrically-structured latent space is common to both pose hypotheses and query images with arbitrary field of views. As a result, LASER achieves state-of-the-art performance on large-scale indoor localization datasets (i.e. ZInD and Structured3D) for both panorama and perspective image queries, while significantly outperforming existing learning-based methods in speed.



VOLDOR: Visual Odometry from Log-logistic Dense Optical flow Residuals

April 2021

·

131 Reads

We propose a dense indirect visual odometry method taking as input externally estimated optical flow fields instead of hand-crafted feature correspondences. We define our problem as a probabilistic model and develop a generalized-EM formulation for the joint inference of camera motion, pixel depth, and motion-track confidence. Contrary to traditional methods assuming Gaussian-distributed observation errors, we supervise our inference framework under an (empirically validated) adaptive log-logistic distribution model. Moreover, the log-logistic residual model generalizes well to different state-of-the-art optical flow methods, making our approach modular and agnostic to the choice of optical flow estimators. Our method achieved top-ranking results on both TUM RGB-D and KITTI odometry benchmarks. Our open-sourced implementation is inherently GPU-friendly with only linear computational and storage growth.


Fig. 1: Reconstructed scene models. 3D point clouds are aggregated from keyframe depth maps.
Fig. 2: VOLDOR + SLAM architecture. 1) Input optical flow is externally computed from video, along with optional geometric priors. 2) Dense-indirect VO front-end estimates scene structure and local camera poses over a sliding window. 3) A pose graph enforces global consistency among all pairwise pose estimates. 4) The set of edges to include in our pose graph is prioritized based on keyframe geometric analysis aimed at both identifying loop closures and reinforce local connectivity.
VOLDOR-SLAM: For the Times When Feature-Based or Direct Methods Are Not Good Enough

April 2021

·

262 Reads

·

1 Citation

We present a dense-indirect SLAM system using external dense optical flows as input. We extend the recent probabilistic visual odometry model VOLDOR [Min et al. CVPR'20], by incorporating the use of geometric priors to 1) robustly bootstrap estimation from monocular capture, while 2) seamlessly supporting stereo and/or RGB-D input imagery. Our customized back-end tightly couples our intermediate geometric estimates with an adaptive priority scheme managing the connectivity of an incremental pose graph. We leverage recent advances in dense optical flow methods to achieve accurate and robust camera pose estimates, while constructing fine-grain globally-consistent dense environmental maps. Our open source implementation [https://github.com/htkseason/VOLDOR] operates online at around 15 FPS on a single GTX1080Ti GPU.


Citations (4)


... In contrast, our approach offers special treatment for detecting small motions instantaneously. Another possible way for moving object detection is through 3D detection [14,18,27,28,34,37,40,58] and tracking [19,54]. However, we found empirically that such methods stumble in identifying small motions due to imperfect object localization. ...

Reference:

Instantaneous Perception of Moving Objects in 3D
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
  • Citing Conference Paper
  • June 2023

... One line of research approaches localization within a traditional optimization framework, leveraging floor plans in combination with various sensor signals such as LiDAR (Boniardi et al. 2019a(Boniardi et al. , 2017Li, Ang, and Rus 2020;Mendez et al. 2020;Wang, Marcotte, and Olson 2019), images (Ito et al. 2014;Boniardi et al. 2019b), or visual odometry (Chu, Kim, and Chen 2015). Another branch adopts learning-based methods (Howard-Jenkins, Ruiz-Sarmiento, and Prisacariu 2021; Howard-Jenkins and Prisacariu 2022; Min et al. 2022;Chen et al. 2024) to solve the localization task. Some studies have also attempted to use augmented topological maps generated from sketch floor plans (Setalaphruk et al. 2003) or architectural floor plans (Li et al. 2021b) to aid the navigation. ...

LASER: LAtent SpacE Rendering for 2D Visual Localization
  • Citing Conference Paper
  • June 2022

... The most accurate methods often involve multi-modal sensor fusion, combining data from sensors like LiDARs and cameras to leverage their strengths. These approaches have shown strong performance by taking advantage of the dense and precise data these sensors provide [1][2][3]24]. However, despite their advantages, both LiD-ARs and cameras present limitations that restrict their use in certain environments. ...

VOLDOR + SLAM: For the times when feature-based or direct methods are not good enough
  • Citing Conference Paper
  • May 2021

... [56,58,61] proposed deep networks to directly estimate ego pose between pairs of frames. [33,39,47,52] integrate learned representations (features or depth) into traditional ego-pose estimation pipelines. [19,51,59] imposed geometric constraints on ego-pose network outputs via differentiable optimization layers. ...

VOLDOR: Visual Odometry From Log-Logistic Dense Optical Flow Residuals
  • Citing Conference Paper
  • June 2020