Article

# A Taxonomy And Evaluation Of Dense Two-Frame Stereo Correspondence Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## No full-text available

... These approaches first estimate the scene geometry (or depth), then warp the input images to the target viewpoint according to the estimated geometry and blend them. Standard LF depth estimation approaches follow the pipeline of stereo matching [29], which consists of feature description (extraction), cost computation, cost aggregation (or cost volume filtering), depth regression and post refinement. Because the difference of the data attribute, LF provides various depth cues for the feature description and cost computation, e.g., structure tensor-based local direction estimation [30], orthographic Hough transform for curve estimation [31], depth from correspondence [32], depth from defocus [33], [34] and depth from parallelogram cues [35]. ...
... Therefore, the DIBR network, to some extent, can be interpreted as a depth estimator. In a standard depth estimation pipeline, the initial depth map can be extracted from the cost volume using the WTA strategy [29], i.e., arg min d∈D C d . However, this strategy is not able to regress a smooth disparity estimate. ...
... For the depth estimation, these methods typically construct a cost volume that records the matching cost of each pixel along the dimension of depth hypothesis. The depth of the scene is solved from the cost volume using Winner Takes All (WTA) strategy [29], e.g., argmin operation. Instead of focusing on solving the depth, our idea is to render the desired LF directly from the cost volume. ...
Preprint
Full-text available
In this paper, we present a Geometry-aware Neural Interpolation (Geo-NI) framework for light field rendering. Previous learning-based approaches either rely on the capability of neural networks to perform direct interpolation, which we dubbed Neural Interpolation (NI), or explore scene geometry for novel view synthesis, also known as Depth Image-Based Rendering (DIBR). Instead, we incorporate the ideas behind these two kinds of approaches by launching the NI with a novel DIBR pipeline. Specifically, the proposed Geo-NI first performs NI using input light field sheared by a set of depth hypotheses. Then the DIBR is implemented by assigning the sheared light fields with a novel reconstruction cost volume according to the reconstruction quality under different depth hypotheses. The reconstruction cost is interpreted as a blending weight to render the final output light field by blending the reconstructed light fields along the dimension of depth hypothesis. By combining the superiorities of NI and DIBR, the proposed Geo-NI is able to render views with large disparity with the help of scene geometry while also reconstruct non-Lambertian effect when depth is prone to be ambiguous. Extensive experiments on various datasets demonstrate the superior performance of the proposed geometry-aware light field rendering framework.
... We preserve the conventional and usual steps, namely the depth maps partitioning into CTUs, CTUs partitioning into CUs, HEVC prediction modes selection, quantization and entropy encoding. • Side 2 : EL construction 1. Contour detection of the input depth map : In [10], authors aimed to evaluate the performance of some state-of-the-art dense two-frame stereo correspondence algorithms over three kind of regions, i.e., textureless, occluded and depth discontinuity regions. ...
... Figure 6 shows the rate (kbps)/SI-SSIM (dB) trade-offs of the candidate methods. In order to obtain spaced curves, we convert the values of SI-SSIM, usually ranged from 0 to 1, to decibels, d d B = −10log 10 (1 − d) where d and d d B are, respectively, the original and converted to decibels SI-SSIM values. According to the curves in Fig. 6, we notice that 3S-DM allows better SI-SSIM scores than SHVC from a contour width equal to 1; and this in contrast to the rate/PSNR performances in Fig. 5 where our method surpasses SHVC from width 5. ...
Article
Full-text available
Being a scalable extension of the High Efficiency Video Coding (HEVC), Scalable High Efficiency Video Coding (SHVC) standard makes it possible to perform scalable encodings. It produces a single binary stream over several layers built from the same video at different scales of resolutions, frequencies, qualities, pixel depths, or color dynamics. However, SHVC is dedicated to the scalable compression of conventional 2D videos whose only component is the texture image, while a compact and highly scalable representation of depth data is also required in several innovative current and future applications. Finalized in February 2015, 3D High Efficiency Video Coding (3D-HEVC) was introduced as a standard dedicated to depth maps compression. But, it does not allow scalable compression of these latter. We are then faced with 3D-HEVC, a standard adapted to depth maps but not scalable, and SHVC, a standard for scalable compression but not adapted to depth maps. In this paper, we aim to propose our custom SHVC in order to handle the signal-to-noise ratio (SNR) scalable compression of depth maps. This codec consists in limiting SNR scalability to sharp depth discontinuities and their neighborhoods. Increasing quantization parameters values are then conditionally used for the quantization of the coding units transform coefficients as we move away from the contours. Our tailored SHVC codec, when compared to the unmodified SHVC and a 3D-HEVC-based state-of-the-art method, significantly improves the distortion vs. rate performance for benchmark depth maps sequences.
... Once solved and with the cameras parameters, it can be used to estimate the 3D depth information of a scene. This is achieved by determining the disparity of matched pixels between the stereo viewpoint images [189] (either using areabased or feature-based approaches). This problem relies on the following constraints hypotheses: ...
... Noury [127] conducted metric depth estimation per micro-image. Their method is inspired from standard dense stereo matching techniques [189] but applied to micro-images from the raw plenoptic image. It relies on a minimization process of the dense reprojection error of the reconstructed neighbors micro-images given a depth hypothesis following the projection model of the camera. ...
Thesis
Full-text available
This thesis investigates the use of a vision sensor called a plenoptic camera for computer vision in robotics applications. To achieve this goal we place ourselves upstream of applications, and focus on its modelization to enable robust depth estimation.Plenoptic or light-field cameras are passive imaging systems able to capture both spatial and angular information about a scene in a single exposure. These systems are usually built upon a micro-lenses array (MLA) placed between a main lens and a sensor. Their design enables depth estimation from a single acquisition.The key contributions of this work lie in answering the questions "How can we link world space information to the image space information?" and more importantly, "How can we link image space information to world space information?". We address the first problem through the prism of calibration, by proposing a new camera model and a methodology to retrieve the intrinsic parameters of this model. We leverage blur information where it was previously considered as a drawback by explicitly modeling the defocus blur. We address the second one as the problematic of depth estimation, by proposing a metric depth estimation framework working directly with raw plenoptic images. It takes into account both correspondence and defocus cues. Our model generalizes to various configurations, including the multi-focus plenoptic camera (both in Galilean and Keplerian configuration), as well as to the single-focus and unfocused plenoptic camera. Our method gives accurate and precise depth estimates (a median relative error ranging from 1.27% to 4.75% of the distance). It outperforms state-of-the-art methods.Having a new complete camera model and enabling robust metric depth estimation from raw images only, opens the door to many new applications. It is an additional step towards practical use of plenoptic cameras in computer vision applications.
... Recent approaches [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28] addressed these challenges by carefully designing deep Convolutional Neural Networks (CNNs)-based models analogously to the classical matching pipeline [29], [30], namely feature extraction, cost aggregation, and flow estimation. Several works [21], [22], [25], [27], [31], [32] focused on the feature extraction stage, as it has been demonstrated that the more powerful feature representation the model learns, the more robust matching is obtained [27], [31], [32]. ...
... Methods for semantic correspondence generally follow the classical matching pipeline [29], [30], including feature extraction, cost aggregation, and flow estimation. Most early efforts [6], [14], [46] leveraged the hand-crafted features which are inherently limited in capturing high-level semantics. ...
Preprint
Full-text available
Cost aggregation is a highly important process in image matching tasks, which aims to disambiguate the noisy matching scores. Existing methods generally tackle this by hand-crafted or CNN-based methods, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields and inadaptability. In this paper, we introduce Cost Aggregation with Transformers (CATs) to tackle this by exploring global consensus among initial correlation map with the help of some architectural designs that allow us to fully enjoy global receptive fields of self-attention mechanism. Also, to alleviate some of the limitations that CATs may face, i.e., high computational costs induced by the use of a standard transformer that its complexity grows with the size of spatial and feature dimensions, which restrict its applicability only at limited resolution and result in rather limited performance, we propose CATs++, an extension of CATs. Our proposed methods outperform the previous state-of-the-art methods by large margins, setting a new state-of-the-art for all the benchmarks, including PF-WILLOW, PF-PASCAL, and SPair-71k. We further provide extensive ablation studies and analyses.
... Estimating depth from stereo image pairs is one of the most fundamental tasks in computer vision (Scharstein and Szeliski, 2002). This task is vital for many applications, such as 3D reconstruction (Geiger et al., 2011), robot navigation and control (Song et al., 2013), object detection and recognition (Chen et al., 2015). ...
... Using mathematical foundations around BRDF, we demonstrate that to compute an irradiance image (Scharstein and Szeliski, 2002) where left (a) and right (b) stereo images have different light conditions. (b) and (d) are the illumination invariant irradiance images corresponding to (a) and (c), that are computed using our radiometric difference removal approach. ...
... Une caméra traditionnelle projette une scène 3D sur le repère capteur en 2D et perd nécessairement l'information selon le 3ème axe z. Une caméra stéréoscopique peut retrouver cette composante par triangulation [SSZ01,BBH03] comme montré dans la figure 1.3. Une matrice de caméras peut aussi simuler un capteur de plus haute résolution par agrégation des pixels issus des différents capteurs. ...
... Source :[ŽL16] paramétrable peut reconnaître des similitudes entre des images extrêmement différentes.La reconstruction 2.5D (carte de profondeur) utilise ces mêmes principes mais avec une densité de correspondance plus élevée. Les algorithmes usuels considèrent par exemple diverses métriques pour mesurer les similarités, donc les disparités, telles que l'entropie[Hir07], le flot optique[BMK13], la NCC[HLL11], la SAD ou la SSD[SSZ01] pour identifier les meilleures correspondances entre 2 caméras voisines. Un CNN utilise sa propre métrique pour minimiser au mieux l'écart entre la vérité terrain et ce qui lui est fourni en entrée. ...
Thesis
Les systèmes de vision font partie intégrante de notre société et continuent d’alimenter de nombreux axes de recherche et développement. Les systèmes multi-vues permettent d’étendre le champ des applications de vision par leur capacité à capturer une scène sous différents angles de vue. La vision stéréoscopique et la reconstruction 3D ou panoramique sont des exemples concrets d’applications multi-vues et sont déjà massivement employés dans l’industrie ou auprès du grand public. Par exemple, certaines productions cinématographiques utilisent des effets spéciaux à partir de dizaines de caméras synchronisées. Ou encore les satellites d’observation privilégient des matrices de caméras pour créer des images en haute-résolution. Cette versatilité, la diversité ou la redondance d’informations visuelles sont exploitables dans de nombreux domaines et en particuliers pour l’intelligence artificielle appliquée à la vision. Les dernières recherches ont démontré que les algorithmes de vision classiques sont devenus obsolètes face aux réseaux de convolutions (CNNs). Cependant, les CNNs restent extrêmement coûteux en calculs et complexes à mettre en œuvre sur des électroniques basse consommation. En théorie, les circuits reprogrammables (FPGA) possèdent un potentiel significatif pour inférer un réseau de convolutions au plus près d’un capteur d’image avec un modèle de calcul orienté flots de données, garantissant à la fois une consommation électrique limitée et une grande vitesse d’exécution. En pratique, ils restent limités par leur faible densité d’éléments logiques et par d’autres facteurs inhérents à cette technologie. Par conséquent, un FPGA ne peut supporter à lui seul qu’une faible partie d’un réseau. Dans cette thèse, nous proposons une architecture matérielle multi-vues pour augmenter le facteur d’intégration d’un réseau sur ce type de cible matérielle. En effet, même dans sa forme la plus simple,une architecture de réseau de convolutions multi-vues possède une meilleure capacité de reconnaissance que sa contrepartie mono-vue. Ce principe fondamental pose la question de l’équilibre entre les facteurs quantitatif et qualitatif d’un réseau, c’est-à-dire, la densité de paramètres définissant la complexité d’un CNN et son niveau de performance en termes de reconnaissance. Ce travail démontre de manière expérimentale que la nature multi-vues permet d’abaisser la quantité de paramètres d’un réseau et par conséquent, sa complexité. Un prototype de caméra embarquée est proposée pour tester cette hypothèse dans un contexte réaliste. Celle-ci est scindée en deux parties. La première est composée de plusieurs têtes de caméras reprogrammables capables d’inférer une partie d’un CNN en supportant le débit d’un capteur d’image. Celles-ci sont pilotées et alimentées par la seconde partie composée d’un FPGA central qui agrège les flots de données issues de chaque capteur et exécuter la suite du réseau de neurones. Entre autres, cette caméra peut synchroniser ou ajouter des délais pour chaque prise de vue avec une base de temps de l’ordre de la dizaine de nanosecondes.
... However, these methods improved the accuracy of disparity estimation at the expense of increasing computational complexity of algorithms, causing more time consuming. According to the evaluation of popular stereo datasets (Middlebury benchmark [19] and the KITTI benchmark [14]), most stereo matching algorithms proposed in the last 3 years are still far from the real-time application, even if they are implemented in the most powerful hardware such as NVidia GTX Titan X GPU [10,12]. Furthermore, whether conventional stereo matching method or CNN, binocular stereo matching are always difficult to deal with occluded and low texture regions. ...
... Because there are not stereo image triplets in the Middlebury [19] and the KITTI datasets [5,14], which are popular with most research papers in this filed, we use CARLA [2], which is an opensource simulator for autonomous driving research, to generate a trinocular street view dataset. The dataset contains dozens of stereo image triplets with 1280 * 720 resolution and complete groundtruth. ...
Article
Full-text available
The huge computational complexity, occlusion and low texture region problems make stereo matching a big challenge. In this work, we use multi-baseline trinocular camera model to study how to accelerate the stereo matching algorithms and improve the accuracy of disparity estimation. A special scheme named the trinocular dynamic disparity range (T-DDR) was designed to accelerate the stereo matching algorithms. In this scheme, we optimize matching cost calculation, cost aggregation and disparity computation steps by narrowing disparity searching range. Meanwhile, we designed another novel scheme called the trinocular disparity confidence measure (T-DCM) to improve the accuracy of the disparity map. Based on those, we proposed the semi-global matching with T-DDR (T-DDR-SGM) and T-DCM (T-DCM-SGM) algorithms for trinocular stereo matching. According to the evaluation results, the T-DDR-SGM could not only significantly reduce the computational complexity but also slightly improving the accuracy, while the T-DCM-SGM could excellently handle the occlusion and low texture region problems. Both of them achieved a better result. Moreover, the optimization schemes we designed can be extended to the other stereo matching algorithms which possesses pixel-wise matching cost calculation and aggregation steps not only the SGM. We proved that the proposed optimization methods for the trinocular stereo matching are effective and the trinocular stereo matching is useful for either improving accuracy or reducing computational complexity.
... Here we discuss a few popular and recent methods that are closely related to our approach. Interested readers are referred to recent survey papers such as Scharstein and Szeliski (2002) and Janai et al. (2017). ...
... (2) sum up the costs over a window; 3) select the disparity that has the minimal cost; (4) perform a series of post-processing steps to refine the final results. Local methods (Scharstein and Szeliski 2002;Weber et al., 2009) have the advantage of speed. Since each cost within a window can be independently computed, these methods are highly parallelizable. ...
Article
Full-text available
Although deep learning-based methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy, their inference time is typically slow, i.e., less than 4 FPS for a pair of 540p images. The main reason is that the leading methods employ time-consuming 3D convolutions applied to a 4D feature volume. A common way to speed up the computation is to downsample the feature volume, but this loses high-frequency details. To overcome these challenges, we propose a displacement-invariant cost computation module to compute the matching costs without needing a 4D feature volume. Rather, costs are computed by applying the same 2D convolution network on each disparity-shifted feature map pair independently. Unlike previous 2D convolution-based methods that simply perform context mapping between inputs and disparity maps, our proposed approach learns to match features between the two images. We also propose an entropy-based refinement strategy to refine the computed disparity map, which further improves the speed by avoiding the need to compute a second disparity map on the right image. Extensive experiments on standard datasets (SceneFlow, KITTI, ETH3D, and Middlebury) demonstrate that our method achieves competitive accuracy with much less inference time. On typical image sizes (e.g., 540×960\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$540\times 960$$\end{document}), our method processes over 100 FPS on a desktop GPU, making our method suitable for time-critical applications such as autonomous driving. We also show that our approach generalizes well to unseen datasets, outperforming 4D-volumetric methods. We will release the source code to ensure the reproducibility.
... Puis, on raffine l'ensemble des données estimées (les paramètres des modèles de projection des deux caméras, leur pose relative et les poses des damiers) en minimisant conjointement l'erreur de projection sur les images des deux caméras. [Scharstein 2002] distingue deux familles principales d'algorithmes pour le calcul de cartes de disparité : les méthodes locales et globales. ...
... En revanche, les frontières entre les objets conduisant à des discontinuités de disparité ont tendance à être mal estimées. Pour éviter ce phénomène, certains termes de lissage sont définis de manière à conserver les bordures [Scharstein 2002]. ...
Thesis
Les recherches menées au cours de cette thèse de Doctorat s'inscrivent dans les activités du laboratoire commun OPERA (OPtique EmbaRquée Active) impliquant ESSILOR-LUXOTTICA et le CNRS. L’objectif est de contribuer au développement des “lunettes du futur” intégrant des fonctions d'obscurcissement, de focalisation ou d'affichage qui s’adaptent en permanence à la scène et au regard de l’utilisateur. Ces nouveaux dispositifs devront être dotés de capacités de perception, de décision et d’action, et devront respecter des contraintes d'encombrement, de poids, de consommation énergétique et de temps de traitement. Ils présentent par conséquent des connexions évidentes avec la robotique. Dans ce contexte, les recherches ont consisté à investiguer la structure et la construction de tels systèmes afin d’identifier leurs enjeux et difficultés. Pour ce faire, la première tâche a été de mettre en place des émulateurs de divers types de lunettes actives, qui permettent de prototyper et d’évaluer efficacement diverses fonctions. Dans cette phase de prototypage et de test, ces émulateurs s’appuient naturellement sur une architecture logicielle modulaire typique de la robotique. La seconde partie de la thèse s'est focalisée sur le prototypage d’un composant clé des lunettes du futur, qui implique une contrainte supplémentaire de basse consommation : le système de suivi du regard, aussi appelé oculomètre. Le principe d’un assemblage de photodiodes et d’un traitement par réseau de neurones a été proposé. Un simulateur a été mis au point, ainsi qu’une étude de l'influence de l'agencement des photodiodes et de l’hyper-paramétrisation du réseau sur les performances de l'oculomètre.
... The functional model of the method presented in this work is defined as a BNN and is based on two CNN architectures presented in the literature: Geometry and Context Network (GC-Net) proposed by and Cost Volume Analysis Network (CVA-Net) proposed by Mehltretter and Heipke (2021). GC-Net is a dense stereo matching approach that follows the classical taxonomy of Scharstein and Szeliski (2002): First, features are extracted from the left and right image using a Siamese architecture consisting of multiple 2D convolutional layers with residual connections. In the second step, a cost volume is built by concatenating a feature vector from the left image with a feature vector from the right image for all potential point correspondences, defined by the corresponding horizontal epipolar line and the specified disparity range. ...
... According to the definition of this binary classification discussed earlier, c is defined as: c = ¬o ∧ ¬t, where o specifies if the correspondence in the second image is occluded and t whether the pixel in the reference image is located in a weakly textured area. While t is determined based on the reference image directly using the criterion specified by Scharstein and Szeliski (2002), o is predicted by the CVA-Net branch of the proposed network in addition to the log standard deviation. In order to optimise the capability of predicting whether a pixel's correspondence is occluded or not, the loss function is extended by a binary cross-entropy term h, minimising the difference between the predicted and the reference occlusion values o andô. ...
Article
Full-text available
The necessity to identify errors in the context of image-based 3D reconstruction has motivated the development of various methods for the estimation of uncertainty associated with depth estimates in recent years. Most of these methods exclusively estimate aleatoric uncertainty, which describes stochastic effects. On the other hand, epistemic uncertainty, which accounts for simplifications or incorrect assumptions with respect to the formulated model hypothesis, is often neglected. However, to accurately quantify the uncertainty inherent in a process, it is necessary to consider all potential sources of uncertainty and to model their stochastic behaviour appropriately. To approach this objective, a holistic method to jointly estimate disparity and uncertainty is presented in this work, taking into account both aleatoric and epistemic uncertainty. For this purpose, the proposed method is based on a Bayesian Neural Network, which is trained with variational inference using a probabilistic loss formulation. To evaluate the performance of the method proposed, extensive experiments are carried out on three datasets considering real-world indoor and outdoor scenes. The results of these experiments demonstrate that the proposed method is able to estimate the uncertainty accurately, while showing a similar and for some scenarios improved depth estimation capability compared to the dense stereo matching approach used as deterministic baseline. Moreover, the evaluation reveals the importance of considering both, aleatoric and epistemic uncertainty, in order to achieve an accurate estimation of the overall uncertainty related to a depth estimate.
... In theory, 3D information can only be inferred from two or more captures of the scene, as in typical multi-view stereo [142] or structure from motion [117] approaches. However, recent methods are exploring machine learning to perform single-image depth inference [69,96,140]. Most techniques developed so far rely on traditional perspective/pinhole-based cameras, which have a narrow ield of view (FoV) and might require thousands of captures to model large scenes [3]. ...
... We believe that the problems of dataset availability and deinition of standardized quantitative assessment metrics are even more challenging for the stereo-and multi-view-based scenarios since they require more than one capture of the scenes and the ground-truth annotations must refer to a reference view. However, the existence of standardized benchmarks for stereo-related problems in the pinhole case, such as Middlebury [140], can inspire the creation of counterparts to deal with panoramas. ...
Article
This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under omnidirectional optics. We first revisit the basic concepts of the spherical camera model and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360°, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.
... Following photo alignment, intrinsic and extrinsic orientation parameters were computed using the location where the photograph was captured (Javernick et al., 2014). The output was a sparse point cloud which was then processed using a Multi-View Stereo (MVS; Scharstein & Szeliski, 2002) algorithm to increase the reliability of the model. The SfM algorithm relied on distinctive features points to obtain spatial information, whereas MVS matches the values of individual pixels of the previously aligned photographs. ...
... The SfM algorithm relied on distinctive features points to obtain spatial information, whereas MVS matches the values of individual pixels of the previously aligned photographs. This step can improve the resolution of the model by up to three orders of magnitude (Scharstein & Szeliski, 2002). Next, a triangular mesh was generated from the dense point cloud, revealing the 3 D relief of the trace. ...
... A good review of stereo vision algorithm has been presented in (7). Since this publication, authors provide test bed data and maintained a website (8) that compares results of stereo correspondence algorithms. ...
... A review of worldwide UAS regulations is presented by Stöcker et al. (2017). Concurrently, advances in computer vision algorithms (Anderson et al. 2019) and hardware (e.g., GPU) facilitated the use of photogrammetry through Structure from Motion (SfM) (Ullman 1979) and Multi-View Stereo reconstruction (MVS) methods (Scharstein and Szeliski 2002). Today, SfM-MVS methods represent a core data capture and analysis approach widely used by (Anderson et al. 2019;James et al. 2019). ...
Article
Full-text available
Low-altitude high-resolution aerial photographs allow for the reconstruction of structural properties of shallow coral reefs and the quantification of their topographic complexity. This study shows the scope and limitations of two-media (air/water) Structure from Motion—Multi-View Stereo reconstruction method using drone aerial photographs to reconstruct coral height. We apply this method in nine different sites covering a total area of about 7000 m ² , and we examine the suitability of the method to obtain topographic complexity estimates (i.e., seafloor rugosity). A simple refraction correction and survey design allowed reaching a root mean square error of 0.1 m for the generated digital models of the seafloor (without the refraction correction the root mean square error was 0.2 m). We find that the complexity of the seafloor extracted from the drone digital models is slightly underestimated compared to the one measured with a traditional in situ survey method.
... To push forward the performance of visual recognition, several real scene datasets have been developed, e.g., Middlebury [122], and optical flow [123]. KITTI [117] is a dataset originally proposed for stereo, optical flow visual odometry, and 3D object detection tasks. ...
Thesis
In multi-view capture, the focus of attention can be controlled by the viewers rather than by a director, which implies that each viewer can observe a unique point of view. Therefore, this requires placing cameras around the scene to be captured, which could be very expensive. Generating virtual cameras to replace part of the real cameras in the scene reduces the cost of setting up multi-view video. This thesis focuses on generating virtual video transitions in scenes captured by multi-view video to virtually move from one real viewpoint to another in the same scene. The fewer real cameras we use, the less expensive is required in the multi-view video; however, the larger the baseline is. View synthesis methods have attracted our attention as an approach to our problem. However, in the literature, these methods still suffer from visual artifacts in the final rendered image due to occlusions in the new target virtual view. As a first step, we propose a hybrid approach to view synthesis. We first warp the reference views by correcting the occlusions. We merge the pre-processed views via a simple convolution architecture. Warping the reference views reduces the distance between the reference views and the size of the convolutional filters and thus reduces the complexity of the network. Next, we present a hybrid approach. We merge the pre-warped views via a residual encoder-decoder with a Siamese encoder to keep the parameters low. We also propose a hole inpainting algorithm to fill in disocclusions in warped views. In addition, we focus on the quality of user experience for the video transition and the database. First, we perform a creative dataset for the quality of experience of the video transition. Second, we propose an algorithmic-learning-based multiple view synthesis optimizer. The work aims to subjectively evaluate the proposed view synthesis approaches on 8 different video sequences by performing a series of subjective tests.
... The most common descriptors used for stereo matching are intensity based (SAD, NCC) [33], binary (census, daisy) [29], non-parametric (Rank Transform) or other custom ones [6]. An aggregation step improves the robustness against noise or surfaces which are not fronto-parallel [29]. ...
Conference Paper
Full-text available
Depth estimation approaches are crucial for environment perception in applications like autonomous driving or driving assistance systems. Solutions using cameras have always been preferred to other depth estimation methods, due to low sensor prices and their ability to extract rich semantic information from the scene. Monocular depth estimation algorithms using CNNs may fail to reconstruct due to unknown geometric properties of certain objects or scenes, which may not be present during the training stage. Furthermore, stereo reconstruction methods, may also fail to reconstruct some regions for various other reasons, like repetitive surfaces, untextured areas or solar flares to name a few. To mitigate the reconstruction issues that may appear, in this paper we propose two refinement approaches that eliminate regions which are not correctly reconstructed. Moreover, we propose an original architecture for combining the mono and stereo results in order to obtain improved disparity maps. The proposed solution is designed to be fault tolerant such that if an image is not correctly acquired or is corrupted, the system is still able to reconstruct the environment. The proposed approach has been tested on the KITTI dataset in order to illustrate its performance.
... Depth data naturally conveys geometric information, therefore understanding depth computation, its data characteristics and its failure modes are highly pertinent. [27] outlined four steps commonly encountered in classical stereo image pipelines. Despite representational advances via Deep Learning, these steps continue to play a key role [37]. ...
Preprint
Full-text available
The estimation of depth cues from a single image has recently emerged as an appealing alternative to depth estimation from stereo image pairs. The easy availability of these dense depth cues naturally triggers research questions, how depth images can be used to infer geometric object and view attributes. Furthermore, the question arises how the quality of the estimated depth data compares between different sensing modalities, especially given the fact that monocular methods rely on a learned correlation between local appearance and depth, without the notion of a metric scale. Further motivated by the ease of synthetic data generation, we propose depth computation on synthetic images as a training step for 3D pose estimation of rigid objects, applying models on real images and thus also demonstrating a reduced synth-to-real gap. To characterize depth data qualities, we present a comparative evaluation involving two monocular and one stereo depth estimation schemes. We furthermore propose a novel and simple two-step depth-ground-truth generation workflow for a quantitative comparison. The presented data generation, evaluation and exemplary pose estimation pipeline are generic and applicable to more complex geometries.
... Furthermore, to determine the potential generalization ability of the model, following the same setting as (Kim et al. 2021), we directly test the trained model over NYU v2 dataset over the additional Middlebury dataset (Scharstein and Pal 2007) and Lu (Lu et al. 2014) dataset, which are another two benchmark datasets to evaluate the performance of Depth Image SR algorithms. The Middlebury dataset contains 30 RGB-D image pairs, of which 21 pairs are from 2001 (Scharstein and Szeliski 2002) and 9 pairs are from 2006 (Hirschmuller and Scharstein 2007). The Lu dataset contains 6 RGB-D image pairs. ...
Preprint
Full-text available
Guided image super-resolution (GISR) aims to obtain a high-resolution (HR) target image by enhancing the spatial resolution of a low-resolution (LR) target image under the guidance of a HR image. However, previous model-based methods mainly takes the entire image as a whole, and assume the prior distribution between the HR target image and the HR guidance image, simply ignoring many non-local common characteristics between them. To alleviate this issue, we firstly propose a maximal a posterior (MAP) estimation model for GISR with two types of prior on the HR target image, i.e., local implicit prior and global implicit prior. The local implicit prior aims to model the complex relationship between the HR target image and the HR guidance image from a local perspective, and the global implicit prior considers the non-local auto-regression property between the two images from a global perspective. Secondly, we design a novel alternating optimization algorithm to solve this model for GISR. The algorithm is in a concise framework that facilitates to be replicated into commonly used deep network structures. Thirdly, to reduce the information loss across iterative stages, the persistent memory mechanism is introduced to augment the information representation by exploiting the Long short-term memory unit (LSTM) in the image and feature spaces. In this way, a deep network with certain interpretation and high representation ability is built. Extensive experimental results validate the superiority of our method on a variety of GISR tasks, including Pan-sharpening, depth image super-resolution, and MR image super-resolution.
... Depth data naturally conveys geometric information, therefore understanding depth computation, its data characteristics and its failure modes are highly pertinent. [9] outlined four steps commonly encountered in classical stereo image pipelines. Despite representational advances via Deep Learning, these steps continue to play a key role [10]. ...
Preprint
Full-text available
Mobile robot operations are becoming increasingly sophisticated in terms of robust environment perception and levels of automation. However, exploiting the great represen-tational power of data-hungry learned representations is not straightforward, as robotic tasks typically target diverse scenarios with different sets of objects. Learning specific attributes of frequently occurring object categories such as pedestrians and vehicles, is feasible since labeled data-sets are plenty. On the other hand, less common object categories call for the need of use-case-specific data acquisition and labelling campaigns, resulting in efforts which are not sustainable with a growing number of scenarios. In this paper we propose a structure-aware learning scheme, which represents geometric cues of specific functional objects (airport loading ramp) in a highly invariant manner, permitting learning solely from synthetic data, and also leading to a great degree of generalization in real scenarios. In our experiments we employ monocular depth estimation for generating depth and surface normal data and in order to express geometric traits instead of appearance. Using the surface normals, we explore two different representations to learn structural elements of the ramp object and decode its 3D pose: as a set of key-points and as a set of 3D bounding boxes. Results are demonstrated and validated in a series of robotic transportation tasks, where the different representations are compared in terms of recognition and metric space accuracy. Te proposed learning scheme can be also easily applied to recognize arbitrary man-made functional objects (e.g. containers, tools) with and without known dimensions.
... Depth estimation algorithms also play an important role in semantic segmentation algorithms [1,2]. To estimate depth maps, a wide range of stereo matching techniques have been proposed and implemented, and Scharstein and Szeliski present an in-depth analysis of these techniques [3]. In recent years, the introduction of light field images has made it possible to generate images at different focal lengths, extended depth of field and estimate the scene depth from a single image capture. ...
Article
Full-text available
Depth estimation for light field images is essential for applications such as light field image compression, reconstructing perspective views and 3D reconstruction. Previous depth map estimation approaches do not capture sharp transitions around object boundaries due to occlusions, making many of the current approaches unreliable at depth discontinuities. This is especially the case for light field images because the pixels do not exhibit photo-consistency in the presence of occlusions. In this paper, we propose an algorithm to estimate the depth map for light field images using depth from defocus. Our approach uses a small patch size of pixels in each focal stack image for comparing defocus cues, allowing the algorithm to generate sharper depth boundaries. Then, in contrast to existing approaches that use defocus cues for depth estimation, we use frequency domain analysis image similarity checking to generate the depth map. Processing in the frequency domain reduces the individual pixel errors that occur while directly comparing RGB images, making the algorithm more resilient to noise. The algorithm has been evaluated on both a synthetic image dataset and real-world images in the JPEG dataset. Experimental results demonstrate that our proposed algorithm outperforms state-of-the-art depth estimation techniques for light field images, particularly in case of noisy images.
... To estimate the value of error of disparity map will be use the method of computing error percentage of bad matching pixels (B) of disparity map with deference to ground truth image obtainable with the Middlebury dataset, as given by the following. [7] Where are ground truth disparities are computed disparities , and denotes to disparity error tolerance, which is taken as (95) in this work according to the scale of taken images. Also N represent window size which is taken as 9*9 pixels in this method which is giving best results . ...
Article
Full-text available
This work represents a proposed approach to extract depth map of stereoscopic images depended on segmentation of lightness values of pixels 'L' using adaptive K-Means cluster. The proposed approach is founding the disparity map of segmentation lightness pixels and refines those segment by using morphological filtering and connected components analysis. Experimental result from Middlebury dataset show that the proposed approach perform good results in term of accurate depth and time consuming compared with classical SAD approach and SAD with GRAD algorithm.
... Therefore, accurate initial guesses obtained by integer-pixel subset correlation methods are critical to ensure the rapid convergence 205 and reduce the computational cost 206 . In stereovision, the matching algorithms can be classified as local [207][208][209] , semi-global 210 , and global methods 211 . Local matching methods utilize the intensity information of a local subset centered at the pixel to be matched. ...
Article
Full-text available
With the advances in scientific foundations and technological implementations, optical metrology has become versatile problem-solving backbones in manufacturing, fundamental research, and engineering applications, such as quality control, nondestructive testing, experimental mechanics, and biomedicine. In recent years, deep learning, a subfield of machine learning, is emerging as a powerful tool to address problems by learning from data, largely driven by the availability of massive datasets, enhanced computational power, fast data storage, and novel training algorithms for the deep neural network. It is currently promoting increased interests and gaining extensive attention for its utilization in the field of optical metrology. Unlike the traditional “physics-based” approach, deep-learning-enabled optical metrology is a kind of “data-driven” approach, which has already provided numerous alternative solutions to many challenging problems in this field with better performances. In this review, we present an overview of the current status and the latest progress of deep-learning technologies in the field of optical metrology. We first briefly introduce both traditional image-processing algorithms in optical metrology and the basic concepts of deep learning, followed by a comprehensive review of its applications in various optical metrology tasks, such as fringe denoising, phase retrieval, phase unwrapping, subset correlation, and error compensation. The open challenges faced by the current deep-learning approach in optical metrology are then discussed. Finally, the directions for future research are outlined.
... DIM is usually conducted in four steps: cost computation, cost aggregation, disparity computation, and disparity refinement (Scharstein and Szeliski, 2002). After LiDAR outlier removal in Subsection 3.1, the LGSM optimizes the cost volume of LiDAR projection points and homogenous pixels without changing the traditional DIM pipeline. ...
Preprint
Full-text available
The complementary fusion of light detection and ranging (LiDAR) data and image data is a promising but challenging task for generating high-precision and high-density point clouds. This study proposes an innovative LiDAR-guided stereo matching approach called LiDAR-guided stereo matching (LGSM), which considers the spatial consistency represented by continuous disparity or depth changes in the homogeneous region of an image. The LGSM first detects the homogeneous pixels of each LiDAR projection point based on their color or intensity similarity. Next, we propose a riverbed enhancement function to optimize the cost volume of the LiDAR projection points and their homogeneous pixels to improve the matching robustness. Our formulation expands the constraint scopes of sparse LiDAR projection points with the guidance of image information to optimize the cost volume of pixels as much as possible. We applied LGSM to semi-global matching and AD-Census on both simulated and real datasets. When the percentage of LiDAR points in the simulated datasets was 0.16%, the matching accuracy of our method achieved a subpixel level, while that of the original stereo matching algorithm was 3.4 pixels. The experimental results show that LGSM is suitable for indoor, street, aerial, and satellite image datasets and provides good transferability across semi-global matching and AD-Census. Furthermore, the qualitative and quantitative evaluations demonstrate that LGSM is superior to two state-of-the-art optimizing cost volume methods, especially in reducing mismatches in difficult matching areas and refining the boundaries of objects.
... Traditional stereo matching methods usually consist of four steps: initial matching cost calculation, matching cost aggregation, disparity prediction, and post-processing. These can be categorized into global and local algorithms [3]. Local strategies solely use constant measurement windows or changeable windows to calculate the preliminary cost. ...
Article
Full-text available
Recent stereo matching methods, especially end-to-end deep stereo matching networks, have achieved remarkable performance in the fields of autonomous driving and depth sensing. However, state-of-the-art stereo algorithms, even with the deep neural network framework, still have difficulties at finding correct correspondences in near-range regions and object edge cues. To reinforce the precision of disparity prediction, in the present study, we propose a parallax attention stereo matching algorithm based on the improved group-wise correlation stereo network to learn the disparity content from a stereo correspondence, and it supports end-to-end predictions of both disparity map and edge map. Particular, we advocate for a parallax attention module in three-dimensional (disparity, height and width) level, which structure ensures high-precision estimation by improving feature expression in near-range regions. This is critical for computer vision tasks and can be utilized in several existing models to enhance their performance. Moreover, in order to making full use of the edge information learned by two-dimensional feature extraction network, we propose a novel edge detection branch and multi-featured integration cost volume. It is demonstrated that based on our model, edge detection project is conducive to improve the accuracy of disparity estimation. Our method achieves better results than previous works on both Scene Flow and KITTI datasets.
... Finally, it is important to note that whereas the PM model is approximately optimal for estimating the slant and distance of planar textured surfaces (with uncorrelated image noise), it is not optimal under real-world conditions, where many surfaces (except the ground plane) are non-planar and where there are half occlusions (points with no corresponding point in the other eye). More sophisticated computations are required in natural 3D scenes (Scharstein & Szeliski, 2002;Hirschmuller & Scharstein, 2007). The human visual system is likely to be much more sophisticated than the models considered here. ...
Article
Full-text available
Binocular stereo cues are important for discriminating 3D surface orientation, especially at near distances. We devised a single-interval task where observers discriminated the slant of a densely textured planar test surface relative to a textured planar surround reference surface. Although surfaces were rendered with correct perspective, the stimuli were designed so that the binocular cues dominated performance. Slant discrimination performance was measured as a function of the reference slant and the level of uncorrelated white noise added to the test-plane images in the left and right eyes. We compared human performance with an approximate ideal observer (planar matching [PM]) and two subideal observers. The PM observer uses the image in one eye and back projection to predict a test image in the other eye for all possible slants, tilts, and distances. The estimated slant, tilt, and distance are determined by the prediction that most closely matches the measured image in the other eye. The first subideal observer (local planar matching [LPM]) applies PM over local neighborhoods and then pools estimates across the test plane. The second suboptimal observer (local frontoparallel matching [LFM]) uses only location disparity. We find that the ideal observer (PM) and the first subideal observer (LPM) outperforms the second subideal observer (LFM), demonstrating the additional benefit of pattern disparities. We also find that all three model observers can account for human performance, if two free parameters are included: a fixed small level of internal estimation noise, and a fixed overall efficiency scalar on slant discriminability.
... Traditional methods conduct stereo matching by adopting the classic four-step pipeline (Scharstein and Szeliski, 2002): matching cost computation for similarity measurement of points from two images, cost aggregation for costs smoothness in the local neighborhood, disparity calculation to acquire initial disparities, and disparity refinement to obtain final results. Representative algorithms include graph-cut (Boykov and Jolly, 2001), semi-global matching (Hirschmueller, 2008), patch-match (Bleyer et al., 2011), and so on. ...
Article
Full-text available
Disparity estimation of satellite stereo images is an essential and challenging task in photogrammetry and remote sensing. Recent researches have greatly promoted the development of disparity estimation algorithms by using CNN (Convolutional Neural Networks) based deep learning techniques. However, it is still difficult to handle intractable regions that are mainly caused by occlusions, disparity discontinuities, texture-less areas, and repetitive patterns. Besides, the lack of training datasets for satellite stereo images remains another major issue that blocks the usage of CNN techniques due to the difficulty of obtaining ground-truth disparities. In this paper, we propose an end-to-end disparity learning model, termed hierarchical multi-scale matching network (HMSM-Net), for the disparity estimation of high-resolution satellite stereo images. First, multi-scale cost volumes are constructed by using pyramidal features that capture spatial information of multiple levels, which learn correspondences at multiple scales and enable HMSM-Net to be more robust in intractable regions. Second, stereo matching is executed in a hierarchical coarse-to-fine manner by applying supervision to each scale, which allows a lower scale to act as prior knowledge and guides a higher scale to attain finer matching results. Third, a refinement module that incorporates the intensity and gradient information of the input left image is designed to regress a detailed full-resolution disparity map for local structure preservation. For network training and testing, a dense stereo matching dataset is created and published by using GaoFen-7 satellite stereo images. Finally, the proposed network is evaluated on the Urban Semantic 3D and GaoFen-7 datasets. Experimental results demonstrate that HMSM-Net achieves superior accuracy compared with state-of-the-art methods, and the improvement on intractable regions is noteworthy. Additionally, results and comparisons of different methods on the GaoFen-7 dataset show that it can severs as a challenging benchmark for performance assessment of methods applied to disparity estimation of satellite stereo images. The source codes and evaluation dataset are made publicly available at https://github.com/Sheng029/HMSM-Net.
... Some researchers have found in their experiments that the Markov random field is gradually upgraded to a dedicated technique for graph cutting based on the Markov random field in the face of the transformation energy minimization problem, and the main principle of this technique is the confidence propagation minimization law [22][23][24]. We found in our experiments that the visual communication within and between scan lines is obtained from the Markov random field modeling so that excellent image segmentation capabilities can be obtained in the process of virtual image matching [25][26][27][28]. ...
Article
Full-text available
Cross-media visual communication is an extremely complex task. In order to solve the problem of segmentation of visual foreground and background, improve the accuracy of visual communication scene reconstruction, and complete the task of visual real-time communication. We propose an improved generative adversarial network. We take the generative adversarial network as the basis and add a combined codec package to the generator, while configuring the generator and discriminator as a cascade structure, preserving the feature upsampling and downsampling convolutional layers of visual scenes with different layers through correspondence. To classify features with different visual scene layers, we add a new auxiliary classifier based on convolutional neural networks. With the help of the auxiliary classifier, similar visual scenes with different feature layers have a more accurate recognition rate. In the experimental part, to better distinguish foreground and background in visual communication, we perform performance tests on foreground and background using separate datasets. The experimental results show that our method has good accuracy in both foreground and background in cross-media communication for real-time visual communication. In addition, we validate the efficiency of our method on Cityscapes, NoW, and Replica datasets, respectively, and experimentally demonstrate that our method performs better than traditional machine learning methods and outperforms deep learning methods of the same type.
... Depth can be accurately predicted from stereo matching, which can be broadly divided into binocular stereo and multi-view stereo (MVS). The former requires calibrated setups of rectified stereo pairs, and many traditional [4,25,26,43] and deep learning based methods [8,9,11,30,32,33,40,61,62] have been proposed. Compared with binocular stereo, MVS methods estimate depth from a set of monocular images or a video, where the camera moves and the scene is assumed static. ...
Preprint
Full-text available
In this paper, we present a learning-based approach for multi-view stereo (MVS), i.e., estimate the depth map of a reference frame using posed multi-view images. Our core idea lies in leveraging a "learning-to-optimize" paradigm to iteratively index a plane-sweeping cost volume and regress the depth map via a convolutional Gated Recurrent Unit (GRU). Since the cost volume plays a paramount role in encoding the multi-view geometry, we aim to improve its construction both in pixel- and frame- levels. In the pixel level, we propose to break the symmetry of the Siamese network (which is typically used in MVS to extract image features) by introducing a transformer block to the reference image (but not to the source images). Such an asymmetric volume allows the network to extract global features from the reference image to predict its depth map. In view of the inaccuracy of poses between reference and source images, we propose to incorporate a residual pose network to make corrections to the relative poses, which essentially rectifies the cost volume in the frame-level. We conduct extensive experiments on real-world MVS datasets and show that our method achieves state-of-the-art performance in terms of both within-dataset evaluation and cross-dataset generalization.
... Knowledge-based approaches use prior knowledge to infer that vehicle positions are depicted in a frame. Stereobased methods are used to measure vehicle positions and obstacles in images by using the disparity map [15] and Inverse Perspective Mapping (IPM) [16]. Motion-based techniques use optical flow approach to track the vehicles and obstacles. ...
Research
Full-text available
Smart traffic and information systems require the collection of traffic data from respective sensors for regulation of traffic. In this regard, surveillance cameras have been installed in monitoring and control of traffic in the last few years. Several studies are carried out in video surveillance technologies using image processing techniques for traffic management. Video processing of a traffic data obtained through surveillance cameras is an instance of applications for advance cautioning or data extraction for real-time analysis of vehicles. This paper presents a detailed review of literature in vehicle detection and classification techniques and also discusses about the open challenges to be addressed in this area of research. It also reviews on various vehicle datasets used for evaluating the proposed techniques in various studies.
... To account for such relationships, we observe that few-shot segmentation can be reformulated as semantic correspondence, which aims to find pixel-level correspondences across semantically similar images which may contain large intraclass appearance and geometric variations [13,14,43]. Recent semantic correspondence models [50,25,51,53,42,44,34,65,41] follow the classical matching pipeline [54,47] of feature extraction, cost aggregation and flow estimation. The cost aggregation stage, where matching scores are refined to produce more reliable correspondence estimates, is of particular importance and has been the focus of much research [53,42,52,22,34,29,41,6]. ...
Preprint
Full-text available
This paper presents a novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation. The use of transformers can benefit correlation map aggregation through self-attention over a global receptive field. However, the tokenization of a correlation map for transformer processing can be detrimental, because the discontinuity at token boundaries reduces the local context available near the token edges and decreases inductive bias. To address this problem, we propose a 4D Convolutional Swin Transformer, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias. We additionally boost aggregation performance by applying transformers within a pyramidal structure, where aggregation at a coarser level guides aggregation at a finer level. Noise in the transformer output is then filtered in the subsequent decoder with the help of the query's appearance embedding. With this model, a new state-of-the-art is set for all the standard benchmarks in few-shot segmentation. It is shown that VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
... Estimating depth from images has a long history in computer vision. Several methods use either stereo images [18,19,20], or two or more images taken from different viewing angles [21,22,23]. We try to solve this problem using a single monocular image, without any constraints on the scene of interest. ...
Preprint
Full-text available
Self-supervised deep learning methods for joint depth and ego-motion estimation can yield accurate trajectories without needing ground-truth training data. However, as they typically use photometric losses, their performance can degrade significantly when the assumptions these losses make (e.g. temporal illumination consistency, a static scene, and the absence of noise and occlusions) are violated. This limits their use for e.g. nighttime sequences, which tend to contain many point light sources (including on dynamic objects) and low signal-to-noise ratio (SNR) in darker image regions. In this paper, we show how to use a combination of three techniques to allow the existing photometric losses to work for both day and nighttime images. First, we introduce a per-pixel neural intensity transformation to compensate for the light changes that occur between successive frames. Second, we predict a per-pixel residual flow map that we use to correct the reprojection correspondences induced by the estimated ego-motion and depth from the networks. And third, we denoise the training images to improve the robustness and accuracy of our approach. These changes allow us to train a single model for both day and nighttime images without needing separate encoders or extra feature networks like existing methods. We perform extensive experiments and ablation studies on the challenging Oxford RobotCar dataset to demonstrate the efficacy of our approach for both day and nighttime sequences.
... Stereo matching can be broadly grouped into two approaches: local and global (Scharstein and Szeliski, 2002). Local approaches compute the disparity value by comparing the features of two pixels in the left and right images by constructing a patch around the pixel. ...
... The approaches involved using geometry to constrain and reproduce the concept of stereopsis mathematically and in real-time. Scharstein et al. [136] summarizes most of these ideas in his survey. ...
Preprint
The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of $0-10$ meters and 360{\deg} coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks.
... V ISUAL perception plays an increasingly important role in a number of fields such as robotics, smart vehicles, and augmented/virtual reality (AR/VR). These are broad and complex areas of application that require the solution of a variety of problems including but not limited to image matching [1], [2], camera motion estimation [3], [4], [5], localisation [6], 3D reconstruction [7], [8], [9], [10], and object segmentation [11], [12]. Over the past several decades, the community has achieved great progress in traditional camera based solutions to these problems [13], [14]. ...
Preprint
Event cameras are bio-inspired sensors that perform well in challenging illumination conditions and have high temporal resolution. However, their concept is fundamentally different from traditional frame-based cameras. The pixels of an event camera operate independently and asynchronously. They measure changes of the logarithmic brightness and return them in the highly discretised form of time-stamped events indicating a relative change of a certain quantity since the last event. New models and algorithms are needed to process this kind of measurements. The present work looks at several motion estimation problems with event cameras. The flow of the events is modelled by a general homographic warping in a space-time volume, and the objective is formulated as a maximisation of contrast within the image of warped events. Our core contribution consists of deriving globally optimal solutions to these generally non-convex problems, which removes the dependency on a good initial guess plaguing existing methods. Our methods rely on branch-and-bound optimisation and employ novel and efficient, recursive upper and lower bounds derived for six different contrast estimation functions. The practical validity of our approach is demonstrated by a successful application to three different event camera motion estimation problems.
... Most of the multi-view image-based rendering approaches design the estimation process after being inspired by the classical multi-view stereo scheme [33,38]. If a 3D point is on a surface, which is the case that the 3D point has a significant impact on the rendered image, there is a consensus in the feature set f . ...
Preprint
Full-text available
To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance.
... Vision reconstruction can be utilized for various applications, such as for traditional art learning [7]. Scharstein et al. give algorithm comparison in [8] while Brown et al. provide classic summary in [9]. Rother et al. make foundation for simultaneous 3D recon- Fig. 1 The sensor suite for stereo vision with depth (Intel RealSense D415) equipped on a mobile robot and a UAV (Bottom graphs: left and right images of D415 for stereo matching, respectively, and red lines (manually noted) show that polar correction is conducted to ensure the same pixel points on the two images are on the identical horizontal axis) struction and camera parameter acquisition [10]. ...
Article
Full-text available
Dense depth estimation is significant in robotic systems, such as for mapping, localization, and object recognition. For multiple sensors, an active depth sensor can provide accurate but sparse measurements for environments, and a camera pair can provide dense but imprecise stereo reconstruction results. In this paper, a tightly coupled fusion method is proposed for depth sensor and stereo camera to complete dense depth estimation, and advantages of the two type sensors are combined so as to achieve better depth estimation. An adaptive dynamic cross-arm algorithm are developed to integrate sparse depth measurements into camera-dominated semiglobal stereo matching. To obtain the optimal arm length for a measured pixel point, each cross-arm shape is variational and calculated automatically. Public datasets of KITTI, Middlebury, and Scene Flow datasets are used with comparison experiments to test performance of the proposed method, and real-world experiments are further conducted for verification.
... Based on the taxonomy of stereo correspondence algorithms, stereo matching methods can be categorized in two broad groups namely energy-based and window-based algorithms [6]. The energy-based methods are also called global methods as their cost or objective functions are defined as a function of the entire image extents [7]. ...
Preprint
A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene. These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene. Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views. We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates. The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene. The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures. When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check. The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps.
... Although stereo matching pipelines used in (De Franchis et al., 2014, Qin et al., 2019, d'Angelo et al., 2019 may differ, they all feature the common matching steps as defined in the taxonomy of (Scharstein and Szeliski, 2002): ...
Article
Full-text available
The amount of very high resolution optical satellite images at our disposal is continuously increasing. Besides, associated satellite programs often come with high revisit rates and geometric properties that allow for either opportunistic or by-design 3D stereo reconstruction. Digital Surface Models (DSM) computed from these satellite images offer new possibilities. In the past, the high revisit rate has largely benefited glacier monitoring studies. Now, DSM with increased resolution provided on urban areas can be used for smart city applications as well. However, most of these require 3D modeling of buildings with level of details ranging from 0 to 2. This is where the need for better reconstructed buildings inside DSM arises. Indeed, building edges and corners tend to be smoothed and softened by the stereo matching step of a DSM computation pipeline. This undesired behavior can mostly be linked to the difficult task of optimizing the Disparity Space Image, thus finding good balance between smoothing untextured areas while conserving sharp discontinuities where needed. In this paper, we show how the optimization can benefit from an input building semantic segmentation. We also provide a method to create it from a very high satellite image in epipolar geometry using a convolutional neural network. To help our network generalize well on unseen areas we propose an interactive learning method based on clicked annotations. Eventually, we show that annotations can be automatically created, hence removing the need for an operator and making our solution suitable for operational conditions.
... The approaches involved using geometry to constrain and reproduce the concept of stereopsis mathematically and in real-time. Scharstein et al. [136] summarizes most of these ideas in his survey. ...
Thesis
Full-text available
Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0-10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency.
... Depth data naturally conveys geometric information, therefore understanding depth computation, its data characteristics and its failure modes are highly pertinent. [27] outlined four steps commonly encountered in classical stereo image pipelines. Despite representational advances via Deep Learning, these steps continue to play a key role [37]. ...
Conference Paper
Full-text available
The estimation of depth cues from a single image has recently emerged as an appealing alternative to depth estimation from stereo image pairs. The easy availability of these dense depth cues naturally triggers research questions, how depth images can be used to infer geometric object and view attributes. Furthermore, the question arises how the quality of the estimated depth data compares between different sensing modalities, especially given the fact that monocular methods rely on a learned correlation between local appearance and depth, without the notion of a metric scale. Further motivated by the ease of synthetic data generation, we propose depth computation on synthetic images as a training step for 3D pose estimation of rigid objects, applying models on real images and thus also demonstrating a reduced synth-to-real gap. To characterize depth data qualities, we present a comparative evaluation involving two monocular and one stereo depth estimation schemes. We furthermore propose a novel and simple two-step depth-ground-truth generation workflow for a quantitative comparison. The presented data generation, evaluation and exemplary pose estimation pipeline are generic and applicable to more complex geometries.
... Stereo matching is one of the most heavily investigated topics in computer vision [34]. Trying to mimic human vision, the stereo system comprises two cameras and a computing device to triangulate the 3D shape by estimating the parallax between the left and right images. ...
Preprint
Full-text available
In stereoscope-based Minimally Invasive Surgeries (MIS), dense stereo matching plays an indispensable role in 3D shape recovery, AR, VR, and navigation tasks. Although numerous Deep Neural Network (DNN) approaches are proposed, the conventional prior-free approaches are still popular in the industry because of the lack of open-source annotated data set and the limitation of the task-specific pre-trained DNNs. Among the prior-free stereo matching algorithms, there is no successful real-time algorithm in none GPU environment for MIS. This paper proposes the first CPU-level real-time prior-free stereo matching algorithm for general MIS tasks. We achieve an average 17 Hz on 640*480 images with a single-core CPU (i5-9400) for surgical images. Meanwhile, it achieves slightly better accuracy than the popular ELAS. The patch-based fast disparity searching algorithm is adopted for the rectified stereo images. A coarse-to-fine Bayesian probability and a spatial Gaussian mixed model were proposed to evaluate the patch probability at different scales. An optional probability density function estimation algorithm was adopted to quantify the prediction variance. Extensive experiments demonstrated the proposed method's capability to handle ambiguities introduced by the textureless surfaces and the photometric inconsistency from the non-Lambertian reflectance and dark illumination. The estimated probability managed to balance the confidences of the patches for stereo images at different scales. It has similar or higher accuracy and fewer outliers than the baseline ELAS in MIS, while it is 4-5 times faster. The code and the synthetic data sets are available at https://github.com/JingweiSong/BDIS-v2.
... In order to establish the matching relationship more accurately, in addition to using colour information at the endpoints, image gradient information and census transformation [28] can be The image gradient is defined as the first-order partial derivatives Gx and Gy in the x and y directions. Then, the gradient value of a pixel in the image is defined as the value of the gradient vector m; the phase angle of the gradient vector can be calculated from the relationship between Gx and Gy; the formula is given below. ...
Article
Full-text available
Stereo matching algorithms have been developed for many years but basically focus only on the implementation of existing datasets and are rarely applied to real scenarios, such as industrial robot scenarios. Traditional stereo matching algorithms have a high error rate, and deep learning algorithms are difficult to obtain good results in real scenarios because of their weak generalisation ability and difficult access to training data. In order to use stereo matching algorithms for industrial robot guidance, it is better to design a new traditional algorithm with low time complexity for the characteristics of industrial robot scenarios dominated by planar facets. This paper proposes a new matching method based on subrows of pixels, instead of individual pixels, in order to improve robustness of matching and reduce running time. First, the pixel strings from the same row of the left and right images are divided into several colour‐identical or colour‐gradient segments. Then, the colour and length of the two left and right pixel segment are used as clues to determine a matching relation and obtain the matching type. Then, all match types can be determined according to non‐crossing mapping. Each match type can reason backward to the corresponding spatial state of the stimulus source so that the disparity of pixels in pixel segments representing the spatial state can be calculated. This new matching method makes full use of the stimulus homology constraints and projective geometric constraints of row‐aligned images. The method can obtain good results in industrial robot scenarios and be applied for industrial robot guidance.
... Subsequently, the estimated camera positions, orientations, and parameters are used to generate a dense 3D point cloud using a form of stereo matching. Cross-correlation is used to match a pixel in one image with the corresponding pixel in the next image on the epipolar line [42]. This method is repeated for each and every overlapped pair of images. ...
Article
Full-text available
Background There is a demand for non-destructive systems in plant phenotyping which could precisely measure plant traits for growth monitoring. In this study, the growth of chilli plants (Capsicum annum L.) was monitored in outdoor conditions. A non-destructive solution is proposed for growth monitoring in 3D using a single mobile phone camera based on a structure from motion algorithm. A method to measure leaf length and leaf width when the leaf is curled is also proposed. Various plant traits such as number of leaves, stem height, leaf length, and leaf width were measured from the reconstructed and segmented 3D models at different plant growth stages. Results The accuracy of the proposed system is measured by comparing the values derived from the 3D plant model with manual measurements. The results demonstrate that the proposed system has potential to non-destructively monitor plant growth in outdoor conditions with high precision, when compared to the state-of-the-art systems. Conclusions In conclusion, this study demonstrated that the methods proposed to calculate plant traits can monitor plant growth in outdoor conditions.
Article
Full-text available
Aiming at the problems of high cost and low accuracy of scene details during the depth map generation in 3D reconstruction, we propose an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis. Generally speaking, small‐scale images can better display depth details, while large‐scale images can more reliably display the depth distribution value of the entire image. In order to take advantage of these complementary properties, we crop the input image with different cropped image ratios to generate multiple disparity map candidates, and then use Fourier frequency domain analysis algorithms to fuse disparity mapping candidates into left and right disparity maps. At the same time, we propose a loss function based on MSSIM to compensate the difference between left and right views and realize unsupervised monocular image depth prediction model training. Experimental results show that our method has good performance on the KITTI dataset.
Article
We propose a deep reinforcement learning‐based solution for the 3D reconstruction of objects of complex topologies from a single RGB image. We use a template‐based approach. However, unlike previous template‐based methods, which are limited to the reconstruction of 3D objects of fixed topology, our approach learns simultaneously the geometry and topology of the target 3D shape in the input image. To this end, we propose a neural network that learns to deform a template to fit the geometry of the target object. Our key contribution is a novel reinforcement learning framework that enables the network to also learn how to adjust, using pruning operations, the topology of the template to best fit the topology of the target object. We train the network in a supervised manner using a loss function that enforces smoothness and penalizes long edges in order to ensure high visual plausibility of the reconstructed 3D meshes. We evaluate the proposed approach on standard benchmarks such as ShapeNet, and in‐the‐wild using unseen real‐world images. We show that the proposed approach outperforms the state‐of‐the‐art in terms of the visual quality of the reconstructed 3D meshes, and also generalizes well to out‐of‐category images. We propose a deep reinforcement learning‐based solution for the 3D reconstruction of objects of complex topologies from a single RGB image.
Article
Full-text available
Since the limitation on the onboard equipment, the sparse stereo vision is becoming a suitable choice for the deployment of micro air vehicles (MAV) and small robots. However, for the point feature‐based sparse stereo, most of the current stereo algorithms ignore the similarity between feature points, so it is hard to achieve high accuracy. In addition, the problem of clustered feature distribution will still affect the performance of point feature‐based algorithms in the application. To make up for these deficiencies, the authors propose an improved features from accelerated segment test (FAST) feature detector to suppress the point detection in complex texture regions. Most importantly, the authors present a novel census transform (CT)‐based algorithm that contains two encoders ‘texture orientation’ and ‘texture gradient’ to get a more efficient census bit string for the feature point. Instead of randomly selecting pixels to calculate the bit string, we combine the texture characteristics of the census windows where feature points are located. Compared with the original CT, the processing speed of our method is improved, and the average error of our method is reduced by 18.05%. The evaluation results show the presented improved point feature‐based sparse stereo algorithm has a great value in engineering applications.
Preprint
Full-text available
Deep-learning-based approaches to depth estimation are rapidly advancing, offering superior performance over existing methods. To estimate the depth in real-world scenarios, depth estimation models require the robustness of various noise environments. In this work, a Pyramid Frequency Network(PFN) with Spatial Attention Residual Refinement Module(SARRM) is proposed to deal with the weak robustness of existing deep-learning methods. To reconstruct depth maps with accurate details, the SARRM constructs a residual fusion method with an attention mechanism to refine the blur depth. The frequency division strategy is designed, and the frequency pyramid network is developed to extract features from multiple frequency bands. With the frequency strategy, PFN achieves better visual accuracy than state-of-the-art methods in both indoor and outdoor scenes on Make3D, KITTI depth, and NYUv2 datasets. Additional experiments on the noisy NYUv2 dataset demonstrate that PFN is more reliable than existing deep-learning methods in high-noise scenes.
Article
Full-text available
To support the ongoing size reduction in integrated circuits, the need for accurate depth measurements of on-chip structures becomes increasingly important. Unfortunately, present metrology tools do not offer a practical solution. In the semiconductor industry, critical dimension scanning electron microscopes (CD-SEMs) are predominantly used for 2D imaging at a local scale. The main objective of this work is to investigate whether sufficient 3D information is present in a single SEM image for accurate surface reconstruction of the device topology. In this work, we present a method that is able to produce depth maps from synthetic and experimental SEM images. We demonstrate that the proposed neural network architecture, together with a tailored training procedure, leads to accurate depth predictions. The training procedure includes a weakly supervised domain adaptation step, which is further referred to as pixel-wise fine-tuning. This step employs scatterometry data to address the ground-truth scarcity problem. We have tested this method first on a synthetic contact hole dataset, where a mean relative error smaller than 6.2% is achieved at realistic noise levels. Additionally, it is shown that this method is well suited for other important semiconductor metrics, such as top critical dimension (CD), bottom CD and sidewall angle. To the extent of our knowledge, we are the first to achieve accurate depth estimation results on real experimental data, by combining data from SEM and scatterometry measurements. An experiment on a dense line space dataset yields a mean relative error smaller than 1%.
Article
Full-text available
The volume detection of medical mice feed is crucial to understand the food intake requirements of mice at different growth stages and to grasp their growth, development, and health status. Aiming at the problem of volume calculation in the way of feed bulk in mice, a method for detecting the bulk volume of feed in mice based on binocular stereo vision was proposed. Firstly, the three-dimensional point coordinates of the feed's surface were calculated using the binocular stereo vision three-dimensional reconstruction technology. The coordinates of these dense points formed a point cloud, and then the projection method was used to calculate the volume of the point cloud; and finally, the volume of the mice feed was obtained. We use the stereo matching data set provided by the Middlebury evaluation platform to conduct experimental verification. The results show that our method effectively improves the matching degree of stereo matching and makes the three-dimensional point coordinates of the obtained feed's surface more accurate. The point cloud is then denoised and Delaunay triangulated, and the volume of the tetrahedron obtained after the triangulation is calculated and summed to obtain the total volume. We used different sizes of wood instead of feed for multiple volume calculations, and the average error between the calculated volume and the real volume was 7.12%. The experimental results show that the volume of the remaining feed of mice can be calculated by binocular stereo vision.
ResearchGate has not been able to resolve any references for this publication.