Fig 10 - uploaded by Xingyi Yang
Content may be subject to copyright.
Source publication
Most 3D reconstruction methods may only recover scene properties up to a global scale ambiguity. We present a novel approach to single view metrology that can recover the \emph{absolute} scale of a scene represented by 3D heights of objects or camera height above the ground as well as camera parameters of orientation and field of view, using just a...
Citations
... Weakly supervised learning is the method of constructing prediction models with limited supervision. In the study of Zhu et al. [18], propose a weakly supervised calibration technique for single-view metrology in unconstrained environments, where only one image of a scene with objects of uncertain sizes is available. This work utilizes 2D object annotations from extensive databases, which often contain people and buildings. ...
The process of determining camera settings to deduce geometric attributes from recorded sequences is known as camera calibration. This process is essential in the fields of robotics and computer vision, encompassing both two-dimensional and three-dimensional applications. Traditional calibration methods, however, are time-consuming and require specific expertise. Recent endeavors have demonstrated that learning-based systems can replace the monotonous tasks associated with manual calibrations. Responses have been examined through a range of learning techniques, networks, geometric assumptions, and datasets. A thorough examination of camera calibration systems that rely on learning algorithms is offered in this paper, assessing their advantages and disadvantages. The primary categories of calibration presented are the regular pinhole camera model, distortion camera model, cross-sensor model, and cross-view model. These categories align with current research trends and have diverse applications. As there is no existing standard in this field, a large dataset of calibration has been created, which can be used as a public platform to assess the effectiveness of current methods. This collection consists of both artificially generated and genuine data, including images and videos obtained from various cameras in different locations. The difficulties faced will be analyzed, and alternative avenues for further research will be suggested in the next stage of this project. This survey represents the initial attempt to perform camera calibration using learning-based methods spanning a period of eight years. Our findings indicate that learning-based methods significantly reduce the time and expertise required for calibration while maintaining or improving accuracy compared to traditional methods. Specifically, our research demonstrates a calibration error reduction of up to 20% and speed improvements by a factor of three compared to traditional methods, as well as better adaptability to different camera types and environments.
... Despite there are automated camera calibration algorithms based on roadside monitoring cameras [9], [10], such methods mainly rely on vanishing point detection of vehicles and assume that the vehicle trajectory approximates a straight line or the road is straight, making the camera calibration accuracy susceptible to the influence of vehicle trajectories. In addition to the traditional auto camera calibration methods, researchers have proposed numerous deep learning-based camera calibration methods in recent years [11]- [14]. These methods are trained on largescale public datasets and offer advantages such as being unaffected by traffic flow and not requiring any manual input or prior scene information. ...
Multi-modal sensor fusion plays a vital role in achieving high-quality roadside perception for intelligent traffic monitoring. Unlike on-board sensors in autonomous driving, roadside sensors present heightened calibration complexity, posing a challenge to spatial alignment for data fusion. And existing spatial alignment methods typically focus on one-to-one alignment between cameras and radar sensors and require precise calibration. However, when applied to large-scale roadside monitoring networks, these methods can be difficult to implement and may be vulnerable to environmental influences. In this paper, we present a spatial alignment framework that utilizes geolocation cues to enable multi-view alignment across distributed multi-sensor systems. In this framework, a deep learning-based camera calibration model combined with angle and distance estimation is used for monocular geolocation estimation. A camera parameter approaching method is then used to search for pseudo camera parameters that can tolerate inevitable calibration errors in practice. Finally, the geolocation information is then used for data association between Light Detection and Ranging (LiDAR) and cameras. The framework has been conducted and tested at several intersections in Hangzhou. Experimental results show that the framework can achieve geolocation estimation errors of less than 1.1 m for vehicles traversing the monitored zone, demonstrating the framework's ability to accomplish spatial alignment with a singular execution, and apply it in extensive large-scale roadside sensor fusion scenarios.
... The common approach to estimating extreme 3D rotations for images with very limited or no-overlap as in Fig. 1 relates to the seminal work by Coughlan and Yuille [10] who introduced a technique that is premised on linear structures within an image, arising primarily from three directions that are mutually orthogonal, one vertical (building walls), and two horizontal (ground-level pavements, roads, etc.). Similarly, the "Single View Metrology" by Criminisi et al. [11] and its extensions [59,25,40], use parallel lines in the image and their corresponding vanishing points [20] for camera calibration. Moreover, the relative rotation of a camera can also be estimated using illumination cues [2], by analyzing the directions of the lighting and cast shadows. ...
The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.
... Conventional approaches of single image camera calibration rely on detecting reference objects in the scene, such as a calibration grid [3] or co-planar circles [4]. Other methods take advantage of vanishing point properties, by carefully selecting parallel or orthogonal segments in the 3D scene [5]. However, most of these methods use classic image processing techniques to detect geometric cues, which makes them inapplicable in unstructured environments. ...
Although recent deep learning-based calibration methods can predict extrinsic and intrinsic camera parameters from a single image, their generalization remains limited by the number and distribution of training data samples. The huge computational and space requirement prevents convolutional neural networks (CNNs) from being implemented in resource-constrained environments. This challenge motivated us to learn a CNN gradually, by training new data while maintaining performance on previously learned data. Our approach builds upon a CNN architecture to automatically estimate camera parameters (focal length, pitch, and roll) using different incremental learning strategies to preserve knowledge when updating the network for new data distributions. Precisely, we adapt four common incremental learning, namely: LwF , iCaRL, LU CIR, and BiC by modifying their loss functions to our regression problem. We evaluate on two datasets containing 299008 indoor and outdoor images. Experiment results were significant and indicated which method was better for the camera calibration estimation.
... Hoo [10] reported road surveying from the single view according to the geometry information, such as planar homography, vanishing points, and vanishing lines. Zhu [11] utilized the data-driven priors learned by a deep network to obtain weakly supervised constraints and recover the absolute scale of a scene in the wild from single images. ...
The therapeutic elastic gloves are of great help in the treatment of hand burns and scalds, making superficial and mild burns heal quickly without residual scars or damage to a patient’s functional ability. However, the hand data used to create the elastic gloves for burns and scalds are usually measured by workers. This manual measurement method has a high cost, large error, and unreliable results. In this paper, we propose an image-based parameter measurement method and establish a portable measuring system for finger and palm parameters; these parameters are then applied to create therapeutic gloves for burn and scald treatments. The proposed method can provide an accurate and rapid measurement of the finger and palm parameters. The experimental results for normal hand parameters and injured hand parameters show the effectiveness of the proposed method.
... For the perspective reconstruction, we consider the following mechanism for calculation [49]. The calculation equation is: ...
... Günel et al. [19] introduce the IMDB-23K dataset by gathering publicly available celebrity images and their height information. Zhu et al. [74] use this dataset to learn to predict the height of people in images. Dey et al. [13] estimate the height of users in a photo collection by computing height differences between people in an image, creating a graph that links people across photos, and solving a maximum likelihood estimation problem. ...
While methods that regress 3D human meshes from images have progressed rapidly, the estimated body shapes often do not capture the true human shape. This is problematic since, for many applications, accurate body shape is as important as pose. The key reason that body shape accuracy lags pose accuracy is the lack of data. While humans can label 2D joints, and these constrain 3D pose, it is not so easy to "label" 3D body shape. Since paired data with images and 3D body shape are rare, we exploit two sources of information: (1) we collect internet images of diverse "fashion" models together with a small set of anthropometric measurements; (2) we collect linguistic shape attributes for a wide range of 3D body meshes and the model images. Taken together, these datasets provide sufficient constraints to infer dense 3D shape. We exploit the anthropometric measurements and linguistic shape attributes in several novel ways to train a neural network, called SHAPY, that regresses 3D human pose and shape from an RGB image. We evaluate SHAPY on public benchmarks, but note that they either lack significant body shape variation, ground-truth shape, or clothing variation. Thus, we collect a new dataset for evaluating 3D human shape estimation, called HBW, containing photos of "Human Bodies in the Wild" for which we have ground-truth 3D body scans. On this new benchmark, SHAPY significantly outperforms state-of-the-art methods on the task of 3D body shape estimation. This is the first demonstration that 3D body shape regression from images can be trained from easy-to-obtain anthropometric measurements and linguistic shape attributes. Our model and data are available at: shapy.is.tue.mpg.de
... Detection supports many downstream tasks [47,64,71]. When considering detection for evolving tasks, however, passive learning is not enough. ...
... Other detectors treat detection as a single regression problem [52] or use a transformer architecture [58] to predict all detections in parallel [7]. Detection also supports many downstream vision tasks such as segmentation [22], 3D shape prediction [14], depth [17] and pose estimation [47,64], and single-view metrology [71], to name but a few. In this work, we continue this progress and introduce a novel approach to object manipulation that operates directly from detection. ...
This paper addresses the problem of mobile robot manipulation of novel objects via detection. Our approach uses vision and control as complementary functions that learn from real-world tasks. We develop a manipulation method based solely on detection then introduce task-focused few-shot object detection to learn new objects and settings. The current paradigm for few-shot object detection uses existing annotated examples. In contrast, we extend this paradigm by using active data collection and annotation selection that improves performance for specific downstream tasks (e.g., depth estimation and grasping). In experiments for our interactive approach to few-shot learning, we train a robot to manipulate objects directly from detection (ClickBot). ClickBot learns visual servo control from a single click of annotation, grasps novel objects in clutter and other settings, and achieves state-of-the-art results on an existing visual servo control and depth estimation benchmark. Finally, we establish a task-focused few-shot object detection benchmark to support future research: https://github.com/griffbr/TFOD.
... Then, they train a CNN to regress from a set of synthetic images I to their (known) focal lengths f . Typically, training images are generated by taking crops of the desired focal lengths from 360 degree panoramas [27,28]. While this can be done for any kind of image, and does not require image sequences, it does require access to panoramic images. ...
Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.
... Instead, we estimate the camera directly from the RGB image. Recent work [20,31,69,81] casts this ill-posed regression problem as a classification task. However, training such methods with their losses, e.g. ...
... Single-image Camera Calibration. Recent work [20, 31,69,81] directly estimates camera parameters from a single image. Zhu et al. [81] also recover the height of some scene objects, e.g. ...
... Recent work [20, 31,69,81] directly estimates camera parameters from a single image. Zhu et al. [81] also recover the height of some scene objects, e.g. people and cars, together with the camera geometry. ...
Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. %regress 3D human bodies. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Both qualitative and quantitative analysis confirm that knowing camera parameters during inference regresses better human bodies. Code and datasets are available for research purposes at https://spec.is.tue.mpg.de.