ArticlePDF Available

Abstract and Figures

Convolutional Neural Networks (CNNs) need large amounts of data with ground truth annotation, which is a challenging problem that has limited the development and fast deployment of CNNs for many computer vision tasks. We propose a novel framework for depth estimation from monocular images with corresponding confidence in a self-supervised manner. A fully differential patch-based cost function is proposed by using the Zero-Mean Normalized Cross Correlation (ZNCC) that takes multi-scale patches as a matching strategy. This approach greatly increases the accuracy and robustness of the depth learning. In addition, the proposed patch-based cost function can provide a 0 to 1 confidence, which is then used to supervise the training of a parallel network for confidence map learning and estimation. Evaluation on KITTI dataset shows that our method outperforms the state-of-the-art results.
Content may be subject to copyright.
A preview of the PDF is not available
... Watson et al. [22] developed a monocular depth estimation network incorporating a multi-frame framework, markedly enhancing the depth estimation accuracy. Further research has introduced numerous improvements to self-supervised monocular depth estimation, including real-time processing [23], multi-scale appearance loss [24] and confidence [25]. In brief, monocular depth estimation techniques applied to natural scenes are becoming increasingly mature. ...
... where D GT i denotes the true depth value andD i denotes the depth value estimated by the network at pixel i, and N represents the total number of pixels in the image, and thr takes the same as in previous studies 1. 25 ...
Article
Background Bronchoscopy is an essential measure for conducting lung biopsies in clinical practice. It is crucial for advancing the intelligence of bronchoscopy to acquire depth information from bronchoscopic image sequences. Methods A self‐supervised multi‐frame monocular depth estimation approach for bronchoscopy is constructed. Networks are trained by minimising the photometric reprojection error between the target frame and the reconstructed target frame. The adaptive dual attention module and the details emphasis module are introduced to better capture the edge contour and internal details. In addition, the approach is evaluated on a self‐made dataset and compared against other established methods. Results Experimental results demonstrate that the proposed method outperforms other self‐supervised monocular depth estimation approaches in both quantitative measurement and qualitative analysis. Conclusion Our monocular depth estimation approach for bronchoscopy achieves superior performance in terms of error and accuracy, and passes physical model validations, which can facilitate further research into intelligent bronchoscopic procedures.
... Poggi et al. [24] similarly defined the uncertainty as the absolute error between two outputs generated using the same method and generated an uncertainty map. Additionally, this study applied previously proposed uncertainty estimation methods [25][26][27][28][29][30][31] to depth estimation and analyzed their effectiveness. Eldesokey et al. [32] and Su et al. [33] proposed uncertainty estimation networks to enhance, respectively, depth completion for sparse depth maps and depth estimation from multi-view stereo inputs. ...
Article
Full-text available
Simultaneous localization and mapping, a critical technology for enabling the autonomous driving of vehicles and mobile robots, increasingly incorporates multi-sensor configurations. Inertial measurement units (IMUs), known for their ability to measure acceleration and angular velocity, are widely utilized for motion estimation due to their cost efficiency. However, the inherent noise in IMU measurements necessitates the integration of additional sensors to facilitate spatial understanding for mapping. Visual–inertial odometry (VIO) is a prominent approach that combines cameras with IMUs, offering high spatial resolution while maintaining cost-effectiveness. In this paper, we introduce our uncertainty-aware depth network (UD-Net), which is designed to estimate both depth and uncertainty maps. We propose a novel loss function for the training of UD-Net, and unreliable depth values are filtered out to improve VIO performance based on the uncertainty maps. Experiments were conducted on the KITTI dataset and our custom dataset acquired from various driving scenarios. Experimental results demonstrated that the proposed VIO algorithm based on UD-Net outperforms previous methods with a significant margin.
... have reviewed the potential of DNNs in 3D microscopy 19 . Chen et al. 22 proposed a 3D convolutional DNN and validated the algorithm for medical image segmentation. Yet, DNN models are known to perform best when used in a supervised ML setting, which would require manual data annotation. ...
Article
Full-text available
Three-dimensional information is crucial to our understanding of biological phenomena. The vast majority of biological microscopy specimens are inherently three-dimensional. However, conventional light microscopy is largely geared towards 2D images, while 3D microscopy and image reconstruction remain feasible only with specialised equipment and techniques. Inspired by the working principles of one such technique—confocal microscopy, we propose a novel approach to 3D widefield microscopy reconstruction through semantic segmentation of in-focus and out-of-focus pixels. For this, we explore a number of rule-based algorithms commonly used for software-based autofocusing and apply them to a dataset of widefield focal stacks. We propose a computation scheme allowing the calculation of lateral focus score maps of the slices of each stack using these algorithms. Furthermore, we identify algorithms preferable for obtaining such maps. Finally, to ensure the practicality of our approach, we propose a surrogate model based on a deep neural network, capable of segmenting in-focus pixels from the out-of-focus background in a fast and reliable fashion. The deep-neural-network-based approach allows a major speedup for data processing making it usable for online data processing.
Article
Thin-film measurements for display panels are essential for display research because they provide insight into improving the quality of display panels. To ensure the high-quality fabrication of display panels, display thin films are dried inside the vacuum chamber of a drying device. For this reason, with conventional thin film measurement techniques, it is difficult to observe in real time the surface profile changes of thin films inside the vacuum chamber. In this work, we present an approach to predict three-dimensional (3D) surface profiles from microscopic images of thin films captured from an aerial perspective using a U-Net-based prediction model. The U-Net-based prediction model can extract complex spatial features and correlations from input images by inferring three-dimensional shape structures. Results from the proposed approach show that various surface profiles of organic thin film can be predicted with a low error rate of approximately 1.3%. Furthermore, the approach offers remote real-time monitoring of the surface profiles of thin films only with aerial microscope images, thus facilitating potential advancements in numerous fields of printed electronics as well as thin films for displays.
Article
Full-text available
We present RGB-D-Fusion, a multi-modal conditional denoising diffusion probabilistic model to generate high resolution depth maps from low-resolution monocular RGB images of humanoid subjects. RGB-D-Fusion first generates a low-resolution depth map using an image conditioned denoising diffusion probabilistic model and then upsamples the depth map using a second denoising diffusion probabilistic model conditioned on a low-resolution RGB-D image. We further introduce a novel augmentation technique, depth noise augmentation, to increase the robustness of our super-resolution model.
Article
In the current research field of light field depth estimation, the occlusion of complex scenes and a large amount of computational data are the problems that every researcher must face. For complex occlusion scenes, this paper proposes a depth estimation method based on the fusion of adaptive defocus cues and constrained angular entropy cues, which is more robust to occlusion. At the same time, the compressed sensing theory is used to compress and reconstruct the light field image to solve the problem of a large amount of data in the process of light field image acquisition, transmission, and processing. The experimental results show that the proposed method has a good overall effect in dealing with the depth estimation of the occlusion scene, and the correct depth information can be obtained. The light field image reconstructed by compressed sensing can not only obtain good depth estimation results but also reduce the amount of data effectively.
Conference Paper
Full-text available
Given the recent advances in depth prediction from Convolutional Neural Networks (CNNs), this paper investigates how predicted depth maps from a deep neural network can be deployed for accurate and dense monocular reconstruction. We propose a method where CNN-predicted dense depth maps are naturally fused together with depth measurements obtained from direct monocular SLAM. Our fusion scheme privileges depth prediction in image locations where monocular SLAM approaches tend to fail, e.g. along low-textured regions, and vice-versa. We demonstrate the use of depth prediction for estimating the absolute scale of the reconstruction, hence overcoming one of the major limitations of monocular SLAM. Finally, we propose a framework to efficiently fuse semantic labels, obtained from a single frame, with dense SLAM, yielding semantically coherent scene reconstruction from a single view. Evaluation results on two benchmark datasets show the robustness and accuracy of our approach.
Conference Paper
Full-text available
This paper addresses the problem of depth estimation from a single still image. Inspired by recent works on multi- scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through extensive experimental evaluation we demonstrate the effective- ness of the proposed approach and establish new state of the art results on publicly available datasets.
Conference Paper
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. By exploiting epipolar geometry constraints, we generate disparity images by training our networks with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Chapter
Minimally invasive surgery (MIS) utilizes camera systems and video displays to allow for complex procedures to be performed through small access points. Most commonly, the term is applied to abdominal surgery, but it is also applicable to other areas such as neuro, vascular, and orthopedic surgery. This chapter focuses on MIS in the abdomen, since most work in three-dimensional (3D) imaging in the operating room has been done in this area.