December 2024
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
Publications (281)
December 2024
·
2 Reads
December 2024
·
1 Read
We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, when there is a lack of data about an object, general information from similar objects can be leveraged for better performance. Second, while temporal information is important, the most critical information is in immediate temporal proximity. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of objects. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences of a diverse set animals. Dataset and code will be made publicly available.
November 2024
November 2024
·
1 Citation
October 2024
·
10 Reads
This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.
September 2024
·
1 Read
In this article, we introduce a novel normalization technique for neural network weight matrices, which we term weight conditioning. This approach aims to narrow the gap between the smallest and largest singular values of the weight matrices, resulting in better-conditioned matrices. The inspiration for this technique partially derives from numerical linear algebra, where well-conditioned matrices are known to facilitate stronger convergence results for iterative solvers. We provide a theoretical foundation demonstrating that our normalization technique smoothens the loss landscape, thereby enhancing convergence of stochastic gradient descent algorithms. Empirically, we validate our normalization across various neural network architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViT), Neural Radiance Fields (NeRF), and 3D shape modeling. Our findings indicate that our normalization method is not only competitive but also outperforms existing weight normalization techniques from the literature.
July 2024
This paper tackles the simultaneous optimization of pose and Neural Radiance Fields (NeRF). Departing from the conventional practice of using explicit global representations for camera pose, we propose a novel overparameterized representation that models camera poses as learnable rigid warp functions. We establish that modeling the rigid warps must be tightly coupled with constraints and regularization imposed. Specifically, we highlight the critical importance of enforcing invertibility when learning rigid warp functions via neural network and propose the use of an Invertible Neural Network (INN) coupled with a geometry-informed constraint for this purpose. We present results on synthetic and real-world datasets, and demonstrate that our approach outperforms existing baselines in terms of pose estimation and high-fidelity reconstruction due to enhanced optimization convergence.
June 2024
·
4 Reads
In this paper, we present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and pose the reconstruction problem as a global optimization, minimizing the distance between our predicted surface and the input LiDAR scans. We show how this global optimization can be decomposed into registration and surface reconstruction steps, which are handled well by off-the-shelf methods without any re-training. By careful modeling of continuous-time motion, our reconstructions can compensate for the rolling shutter effects of rotating LiDAR sensors. This allows for the first system (to our knowledge) that properly motion compensates LiDAR scans for rigidly-moving objects, complementing widely-used techniques for motion compensation of static scenes. Beyond pursuing dynamic reconstruction as a goal in and of itself, we also show that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow.
June 2024
·
6 Reads
·
4 Citations
Citations (43)
... Pre-trained with abundant, easily collectible unlabeled data, the detector backbone is shown to largely reduce the labeled data for fine-tuning (Pan et al., 2024;Yin et al., 2022;Xie et al., 2020b). Label-free 3D object detection from point clouds has gained attention due to its effective data utilization (You et al., 2022a;b;Luo et al., 2023;Najibi et al., 2022;Choy et al., 2019;Baur et al., 2024;Seidenschwarz et al., 2024;Yang et al., 2021) and generalization beyond specific class information during training (Najibi et al., 2023). Orthogonally, we study a new scenario to learn the detector in a label-efficient way by considering beyond a single source of information. ...
- Citing Conference Paper
June 2024
... We project the 2D skeletons to D-dimensional features F ∈ R T ×J×D using Random Fourier Features (RFF) [32]. We additionally encode each 2D joint (x, y) with its temporal location t in the video. ...
- Citing Conference Paper
March 2024
... It is an important tool for analyzing scene dynamics and has been extensively studied in computer vision [10,17,23,25,33,41,52,57]. While the scene flow of background points as the dominant rigid motion may be reliably estimated [8], accurately estimating the motion flow for dynamic foreground objects remains a challenge. This leads to object-aware scene flow works [3,22,23,48] that leverage rigidity prior of objects. However, the nearest neighbor-based approach for motion estimation like scene flow suffers from inherent ambiguity brought by the "swimming" effect [21] of Lidar point clouds, which is more severe with smaller motion magnitude. ...
- Citing Conference Paper
March 2024
... As discussed in NTP's original paper (Wang et al., 2022a) and confirmed by our experiments, NTP struggles to converge beyond 25 frames, so we only compare against it in a 20 frame settings. As is typical in the scene flow literature (Chodosh et al., 2023), we perform ego compensation and ground point removal on both Argoverse 2 and Waymo Open using the dataset provided map and ego pose. EulerFlow's dominant performance also holds on Waymo Open (Sun et al., 2020); we compare against several popular methods ( Figure 5), and EulerFlow again out-performs the baselines by a wide margin, more than halving the error over the next best method. ...
- Citing Conference Paper
January 2024
... For a detailed discussion of the curvature expressions and their relationship with the first order (gradient) and second order (Hessian matrix) information of functions, please refer to [22]. This concept is closely related to the mathematical constructs of the Hessian matrix and gradients, which are pivotal in Newton-type methods for optimizing the step size and enhancing the convergence process [13,41,51]. This is crucial for optimizing parameters with varying scales and has proven effective in diverse machine learning tasks [54]. ...
- Citing Conference Paper
October 2023
... S-NeRF [43] densifies per-frame sparse LiDAR scans via a depth completion network, which is used as a pseudo-guidance for depth renderings. Another LiDAR-based NeRF [7] builds a LiDAR map for scene model. However, their proposed rendering method yields sparse images, not to mention that dynamic objects such as cars that are commonly present in urban scenes are not handled. ...
- Citing Conference Paper
October 2023
... Scene flow estimation then becomes the task of estimating this PDE. We can straightfowardly represent this PDE estimate with a neural prior (Li et al., 2021b) and optimize it against scene flow surrogate objectives, both over single frame pairs and extended across arbitrary time intervals, unlocking new optimization objectives that produce better quality estimates. We formalize this in Section 3 and propose the Scene Flow via PDE framework. ...
- Citing Conference Paper
October 2023
... Motion prediction in dynamic scene reconstruction from multi-view videos or monocular videos is an ill-posed problem. Early dynamic NeRF [15,38,39,52] directly learns motion patterns from individual timestamps , and several later works propose motion flow regularization terms [15,26,30,35,40,45,46,51,55,70] to promote the learning of cross-time motion patterns. These methods typically utilize 2D prior information (e.g., optical flow) from pre-trained networks [50] to supervise the scene flows within neighboring frames. ...
- Citing Conference Paper
June 2023
... Initial follow up work focused on reducing the encoding time [35,38] or extending the capabilities of these codecs to various data modalities [12,19,39]. Better architectures and more refined quantization and domain specific optimizations have improved compression performance both for images [16][17][18]36] and video [14,23,24,27,30]. However, many of these improvements come at the cost of increased computational complexity for decoding, weakening one of the key advantages of overfitted codecs in the first place. ...
- Citing Conference Paper
January 2023
... SfM-free novel view synthesis for both NeRFs and 3D Gaussian splatting is a class of works that try to do away with known or estimated camera pose from SfM. Examples include i-NeRF [36], which estimates camera poses by aligning keypoints using a pre-trained NeRF. Followups like NeRFmm [32], SiNeRF [34], BARF [21] and GARF [7] learn both the NeRF model and camera pose embeddings simultaneously [32], addressing the gradient inconsistency [7,21], leveraging pre-trained networks for monocular depth estimation or optical flow, incorporating prior geometric knowledge or correspondence information [4,6,23]. For 3D Gaussian splatting, CF-3DGS [11] and GGRt [20], InstantSplat [9], COGS [14] was developed to support SfM-free optimization. ...
- Citing Chapter
November 2022
Lecture Notes in Computer Science