Conference Paper

U-Net: Convolutional Networks for Biomedical Image Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Both the segmentation and the classification use Convolutional Neural Networks (CNN), more precisely the segmentation task uses Unet (Ronneberger et al., 2015), and the classification is based on ResNet (He et al., 2016) followed by Multi Layer Perceptron (MLP, Alpaydin, 2014). Interestingly, our work demonstrates that ensembles of classifiers are effective in differentiating reliable from probably incorrect classifications, based on the consistency of classifier predictions in the ensemble. ...
... The first block, dealing with the sunspot segmentation, is addressed by a CNN in the form of Unet architecture (Ronneberger et al., 2015) composed of an auto-encoder with skip connections. The encoding part of the auto-encoder has 5 levels and 4 downsampling steps, and reversely the decoding part has 5 levels and 4 upsampling steps. ...
... The UNet architecture (He et al., 2016) is adopted for our CNN segmentation model since it has proved its accuracy in delineating high resolution segmentation masks in various fields such as medical imaging (Isensee et al., 2021;Ronneberger et al., 2015) and natural images analysis (Siam et al., 2018;Sugirtha & Sridevi, 2022). ...
Article
Full-text available
We propose a fully automated system to detect, aggregate, and classify sunspot groups according to the McIntosh scheme using ground‐based white light (WL) observations from the USET facility located at the Royal Observatory of Belgium. The sunspot detection uses a Convolutional Neural Network (CNN), trained from segmentation maps obtained with an unsupervised method based on mathematical morphology and image thresholding. Given the sunspot mask, a mean‐shift algorithm is used to aggregate individual sunspots into sunspot groups. This algorithm accounts for the area of each sunspot as well as for prior knowledge regarding the shape of sunspot group. A sunspot group, defined by its bounding box and location on the Sun, is finally fed into a CNN multitask classifier. The latter predicts the three components Z, p, and c in the McIntosh classification scheme. The tasks are organized hierarchically to mimic the dependency of the second (p) and third (c) components on the first (Z). The resulting CNN‐based segmentation is more accurate than classical unsupervised methods, with an enhancement up to 16% of F1 score in detection of the smallest sunspots, and it is robust to the presence of clouds. The automated clustering method was able to separate groups with an accuracy of 80%, when compared to hand‐made USET sunspot group catalog. The CNN‐based sunspot classifier shows comparable performances to methods using continuum as well as magnetogram images recorded by instruments on space mission. We also show that an ensemble of classifiers allows differentiating reliable and potentially incorrect predictions.
... The mutual conversion between S-domain and W-domain images through two g Two discriminators are used to discern between generated images and real data.Both generators, and , employ the U-Net[22] architecture with eigh lutional layers for the encoder and decoder, as shown inFigure 4a. The encoder of a convolutional layer, six Relu-Conv-LayerNorm (RCL) blocks, and one Relu-C block. ...
... The mutual conversion between S-domain and W-domain images through two generators. Two discriminators are used to discern between generated images and real data.Both generators, G A and G B , employ the U-Net[22] architecture with eight convolutional layers for the encoder and decoder, as shown inFigure 4a. The encoder consists of a convolutional layer, six Relu-Conv-LayerNorm (RCL) blocks, and one Relu-Conv (RC) block. ...
... The mutual conversion between S-domain and W-domain images through two generators. Two discriminators are used to discern between generated images and real data.Both generators, and , employ the U-Net[22] architecture with eight convolutional layers for the encoder and decoder, as shown inFigure 4a. The encoder consists of a convolutional layer, six Relu-Conv-LayerNorm (RCL) blocks, and one Relu-Conv (RC) block. ...
Article
Full-text available
The cornea is an important refractive structure in the human eye. The corneal segmentation technique provides valuable information for clinical diagnoses, such as corneal thickness. Non-contact anterior segment optical coherence tomography (AS-OCT) is a prevalent ophthalmic imaging technique that can visualize the anterior and posterior surfaces of the cornea. Nonetheless, during the imaging process, saturation artifacts are commonly generated due to the tangent of the corneal surface at that point, which is normal to the incident light source. This stripe-shaped saturation artifact covers the corneal surface, causing blurring of the corneal edge, reducing the accuracy of corneal segmentation. To settle this matter, an inpainting method that introduces structural similarity and frequency loss is proposed to remove the saturation artifact in AS-OCT images. Specifically, the structural similarity loss reconstructs the corneal structure and restores corneal textural details. The frequency loss combines the spatial domain with the frequency domain to ensure the overall consistency of the image in both domains. Furthermore, the performance of the proposed method in corneal segmentation tasks is evaluated, and the results indicate a significant benefit for subsequent clinical analysis.
... To determine the effect produced by this feedback mechanism in the network training and the overall generalizability of the network, we use an architecture that does not utilize this feature for comparison. As the reference method, we use the U-Net architecture, which is one of the most common CNN-based neural networks used in medical image segmentation [23] and denoising [24]. U-Net's deep architecture and skip connections enable it to learn complex correlations in images. ...
... The U-Net framework was selected as the first approach for noise reduction [37]. U-Net is a commonly used layered encoder-decoder neural network for segmentation [23] and denoising tasks [24]. In the U-Net, the information propagates through the encoder and decoder layers, allowing the encoder to store important information from the image while the decoder can effectively use this information to remove noise. ...
... In the U-Net, the information propagates through the encoder and decoder layers, allowing the encoder to store important information from the image while the decoder can effectively use this information to remove noise. In addition, with the help of skip connections between the encoder and decoder layers, more spatial information can be preserved, which may help the recovery of finer details during noise removal [23]. ...
Article
Full-text available
Purpose: The use of iterative and deep learning reconstruction methods, which would allow effective noise reduction, is limited in cone-beam computed tomography (CBCT). As a consequence, the visibility of soft tissues is limited with CBCT. The study aimed to improve this issue through time-efficient deep learning enhancement (DLE) methods. Methods: Two DLE networks, UNIT and U-Net, were trained with simulated CBCT data. The performance of the networks was tested with three different test data sets. The quantitative evaluation measured the structural similarity index measure (SSIM) and the peak signal-to-noise ratio (PSNR) of the DLE reconstructions with respect to the ground truth iterative reconstruction method. In the second assessment, a dentomaxillofacial radiologist assessed the resolution of hard tissue structures, visibility of soft tissues, and overall image quality of real patient data using the Likert scale. Finally, the technical image quality was determined using modulation transfer function, noise power spectrum, and noise magnitude analyses. Results: The study demonstrated that deep learning CBCT denoising is feasible and time efficient. The DLE methods, trained with simulated CBCT data, generalized well, and DLE provided quantitatively (SSIM/PSNR) and visually similar noise-reduction as conventional IR, but with faster processing time. The DLE methods improved soft tissue visibility compared to the conventional Feldkamp-Davis-Kress (FDK) algorithm through noise reduction. However, in hard tissue quantification tasks, the radiologist preferred the FDK over the DLE methods. Conclusion: Post-reconstruction DLE allowed feasible reconstruction times while yielding improvements in soft tissue visibility in each dataset.
... Recent works in IR use deep learning to segment and localise the pupil and the iris in a periocular image [11], [31]- [33]. The Criss-Cross Attention Network (CCNet), developed by Mishra et al. [33], is an iris semantic segmentation network based on U-Net [32]. This lightweight network was trained by Fang et al. [24] to predict a binary mask of the iris from NIR images. ...
... It was developed as a reduced version of U-Net [32], which is called UNet_xxs, for the purpose of iris semantic segmentation under limited computational power. Iteratively, we removed layers from CCNet's version of U-Net [24], [33] trained and tested performance until a small network with acceptable performance was obtained. ...
Article
Full-text available
Iris Recognition (IR) is one of the market’s most reliable and accurate biometric systems. Today, it is challenging to build NIR-capturing devices under the premise of hardware price reduction. Commercial NIR sensors are protected from modification. The process of building a new device is not trivial because it is required to start from scratch with the process of capturing images with quality, calibrating operational distances, and building lightweight software such as eyes/iris detectors and segmentation sub-systems. In light of such challenges, this work aims to develop and implement iris recognition software in an embedding system and calibrate NIR in a contactless binocular setup. We evaluate and contrast speed versus performance obtained with two embedded computers and infrared cameras. Further, a lightweight segmenter sub-system called ”Unet_xxs” is proposed, which can be used for iris semantic segmentation under restricted memory resources. The evaluations reveal that Unet_xxs reduces the number of parameters by 77% and duplicates the speed of state-of-the-art segmentation models with an EER drop smaller than 1% in Iris Recognition with 8.06 fps and IOU of 0.8382.
... Image Semantic Segmentation: In recent years, with the rapid development of deep learning technology and its wide application in various fields, image semantic segmentation based on deep learning (ISSDL) has also received a lot of attention from researchers across the world [13][14][15][16][17]. And ISSDL can be divided into two major categories [18][19][20]: ISS based on the regional (ISSR) and ISS based on the pixel (ISSP). ...
... The encoder-decoder architecture approach has also been proposed to address the balance between performance and efficiency of semantic segmentation, as well as the high computational complexity and memory consumption problems posed on highresolution feature maps. Ronneberger et al. [14] proposed the classical U-Net network for implementing semantic segmentation of biomedical images, which uses a downsampling operation to gradually reduce the resolution of the feature map in the encoding phase, and a downsampling operation to gradually restore image detail and resolution in the decoding phase. The U-net++ [9]was further developed as a more powerful architecture for medical image segmentation. ...
Article
Full-text available
Achieving accurate segmentation of brain tumors in Magnetic Resonance Imaging (MRI) is important for clinical diagnosis and accurate treatment, and the efficient extraction and analysis of MRI multimodal feature information is the key to achieving accurate segmentation. In this paper, we propose a multimodal information fusion method for brain tumor segmentation, aimed at achieving full utilization of multimodal information for accurate segmentation in MRI. In our method, the semantic information processing module (SIPM) and Multimodal Feature Reasoning Module (MFRM) are included: (1) SIPM is introduced to achieve free multiscale feature enhancement and extraction; (2) MFRM is constructed to process both the backbone network feature information layer and semantic feature information layer. Using extensive experiments, the proposed method is validated. The experimental results based on BraTS2018 and BraTS2019 datasets show that the method has unique advantages over existing brain tumor segmentation methods.
... Researchers in this field have conducted in-depth investigations on single-temporal remote sensing images and have achieved remarkable results [6,7]. Wang et al. [8] integrated the Atrous Spatial Pyramid Pooling (ASPP) module, which encodes image-level features, into the U-Net [9] network, significantly improving the segmentation accuracy of multiscale features in the imagery. However, mainstream segmentation networks are unable to fully recover spatial information discarded in the feature extraction stage, which exacerbates the segmentation inaccuracies caused by fuzzy land feature boundaries in the imagery. ...
... To adapt these networks for semantic segmentation, it is necessary to compensate for the loss of spatial resolution caused by down-sampling. Primarily in the decoding phase, this is achieved by integrating mid-to high-resolution feature maps generated at various stages of the encoder using skip connections, as seen in typical networks like FCN [29] and U-Net [9]. FCN, as the first network to perform end-to-end segmentation using fully convolutional layers, revolutionized research in semantic segmentation. ...
Article
Full-text available
Remote sensing image semantic segmentation plays a crucial role in various fields, such as environmental monitoring, urban planning, and agricultural land classification. However, most current research primarily focuses on utilizing the spatial and spectral information of single-temporal remote sensing images, neglecting the valuable temporal information present in historical image sequences. In fact, historical images often contain valuable phenological variations in land features, which exhibit diverse patterns and can significantly benefit from semantic segmentation tasks. This paper introduces a semantic segmentation framework for satellite image time series (SITS) based on dilated convolution and a Transformer encoder. The framework includes spatial encoding and temporal encoding. Spatial encoding, utilizing dilated convolutions exclusively, mitigates the loss of spatial accuracy and the need for up-sampling, while allowing for the extraction of rich multi-scale features through a combination of different dilation rates and dense connections. Temporal encoding leverages a Transformer encoder to extract temporal features for each pixel in the image. To better capture the annual periodic patterns of phenological phenomena in land features, position encoding is calculated based on the image’s acquisition date within the year. To assess the performance of this framework, comparative and ablation experiments were conducted using the PASTIS dataset. The experiments indicate that this framework achieves highly competitive performance with relatively low optimization parameters, resulting in an improvement of 8 percentage points in the mean Intersection over Union (mIoU).
... V-Net can be used for 3D medical image segmentation. U-Net relies on the use of data enhancement, so it can effectively learn from very few medical images with labels [8] . ...
... This paper uses U-Net and its variant U-Net++ to segment pectoral muscle in order to improve the accuracy of pectoral muscle segmentation in CT images. U-Net is composed of encoder and decoder, belonging to a kind of convolutional neural network [8] . The encoder part extracts and reduces the dimension of the input image by means of convolution and pooling, while the decoder part raises and amplifies the feature of the low-pixel (shallow feature) image by means of up-sampling. ...
Chapter
Chronic obstructive pulmonary disease (COPD) is a common respiratory disease, which seriously endangers human health and is also one of the important causes of death. The death rate of COPD in China is the highest in the world, and the problem of under-diagnosis of the disease is very serious. The gold standard for the diagnosis of COPD is lung function examination, and clinical studies have shown that CT and other imaging methods can be included in the auxiliary diagnosis of COPD. CT images can be used to assess pectoral muscle area, which is associated with COPD severity. Patients with lower pectoral muscle area often have more severe expiratory airflow obstruction and other problems. Therefore, the key of the research is to accurately segment the pectoral muscle in CT images. The medical image segmentation method based on deep learning can dig out more abundant internal information of data, so it has gradually become the preferred method in the aspect of medical image segmentation. In this paper, a pectoral muscle segmentation algorithm based on U-Net and its variant U-Net++ is proposed, which is of great significance for evaluating the severity of disease in patients with COPD. The network is composed of symmetrical encoders and decoders, which can effectively learn from very little labelled data by using appropriate data enhancement methods, and is therefore very suitable for medical image segmentation. The experimental results on the data set provided by Jiangsu Province Hospital show that the average Dice coefficient of the proposed algorithm is more than 94% and the average accuracy rate is 91%. The algorithm can accurately segment the pectoral muscle in CT images and has good segmentation performance.
... Res-UNet uses a UNet encoder-decoder backbone, in combination with residual connections, atrous convolutions, pyramid scene parsing pooling, and multi-tasking inference [19]. To achieve consistent training as the depth of the network increases, the building blocks of the UNet architecture were replaced with modified residual blocks of convolutional layers [20]. For better understanding across scales, multiple parallel atrous convolutions with different dilation rates are employed within each residual building block. ...
Article
Full-text available
Objectives The aim of this study was to investigate the generalization performance of deep learning segmentation models on a large cohort intravascular ultrasound (IVUS) image dataset over the lumen and external elastic membrane (EEM), and to assess the consistency and accuracy of automated IVUS quantitative measurement parameters. Methods A total of 11,070 IVUS images from 113 patients and pullbacks were collected and annotated by cardiologists to train and test deep learning segmentation models. A comparison of five state of the art medical image segmentation models was performed by evaluating the segmentation of the lumen and EEM. Dice similarity coefficient (DSC), intersection over union (IoU) and Hausdorff distance (HD) were calculated for the overall and for subsets of different IVUS image categories. Further, the agreement between the IVUS quantitative measurement parameters calculated by automatic segmentation and those calculated by manual segmentation was evaluated. Finally, the segmentation performance of our model was also compared with previous studies. Results CENet achieved the best performance in DSC (0.958 for lumen, 0.921 for EEM) and IoU (0.975 for lumen, 0.951 for EEM) among all models, while Res-UNet was the best performer in HD (0.219 for lumen, 0.178 for EEM). The mean intraclass correlation coefficient (ICC) and Bland–Altman plot demonstrated the extremely strong agreement (0.855, 95% CI 0.822–0.887) between model's automatic prediction and manual measurements. Conclusions Deep learning models based on large cohort image datasets were capable of achieving state of the art (SOTA) results in lumen and EEM segmentation. It can be used for IVUS clinical evaluation and achieve excellent agreement with clinicians on quantitative parameter measurements.
... The experimental setup used in the vision-based experiments (Fig. 8) is briefly described as follows, with additional details being provided in our previous study [30]. First, a DNN (a multi-task UNet [30], [31]) with 25.50 million parameters and an input image size of 228 × 228 pixels was used in this study for environment perception. The adopted DNN can extract features of RGB images and then perform heading angle regression, road type classification, lane line segmentation, and traffic object detection simultaneously for the ego vehicle at a speed of approximately 40 FPS (frames per second). ...
Preprint
Full-text available
The accurate prediction of smooth steering inputs is crucial for autonomous vehicle applications because control actions with jitter might cause the vehicle system to become unstable. To address this problem in automobile lane-keeping control without the use of additional smoothing algorithms, we developed a soft-constrained iterative linear-quadratic regulator (soft-CILQR) algorithm by integrating CILQR algorithm and a model predictive control (MPC) constraint relaxation method. We incorporated slack variables into the state and control barrier functions of the soft-CILQR solver to soften the constraints in the optimization process so that stabilizing control inputs can be calculated in a relatively simple manner. Two types of automotive lane-keeping experiments were conducted with a linear system dynamics model to test the performance of the proposed soft-CILQR algorithm and to compare its performance with that of the CILQR algorithm: numerical simulations and experiments involving challenging vision-based maneuvers. In the numerical simulations, the soft-CILQR and CILQR solvers managed to drive the system toward the reference state asymptotically; however, the soft-CILQR solver obtained smooth steering input trajectories more easily than did the CILQR solver under conditions involving additive disturbances. In the experiments with visual inputs, the soft-CILQR controller outperformed the CILQR controller in terms of tracking accuracy and steering smoothness during the driving of an ego vehicle on TORCS.
... The output is then fed into a U-Net for classification of the four classes of segmentation. U-Net [35] is used as a transfer learning paradigm for classification. We freeze all the layers of U-Net and use the 4class output of SoftMax. ...
Article
Full-text available
The automatic segmentation of brain tumours is a critical task in patient disease management. It can help specialists easily identify the location, size, and type of tumour to make the best decisions regarding the patients' treatment process. Recently, deep learning methods with attention mechanism helped increase the performance of segmentation models. The proposed method consists of two main parts: the first part leverages a deep neural network architecture for biggest tumour detection (BTD) and in the second part, ResNet152V2 makes it possible to segment the image with the attention block and the extraction of local and global features. The custom attention block is used to consider the most important parts in the slices, emphasizing on related information for segmentation. The results show that the proposed method achieves average Dice scores of 0.81, 0.87 and 0.91 for enhancing core, tumour core and whole tumour on BraTS2020 dataset, respectively. Compared with other segmentation approaches, this method achieves better performance on tumour core and whole tumour. Further comparisons on BraTS2018 and BraTS2017 validation datasets show that this method outperforms other models based on Dice score and Hausdorff criterion.
... Convolutional neural networks have recently gained popularity in automated medical image segmentation. The researchers, inspired by the traditional U-Net architecture [11], which works well for binary segmentation, are continuing to improve it to achieve higher OD and OC segmentation accuracy (Table 2). Here, ROI extraction accompanied by resizing is one of the main steps in image preprocessing. ...
Article
Full-text available
The pathological changes in the eye fundus image, especially around Optic Disc (OD) and Optic Cup (OC) may indicate eye diseases such as glaucoma. Therefore, accurate OD and OC segmentation is essential. The variety in images caused by different eye fundus cameras makes the complexity for the existing deep learning (DL) networks in OD and OC segmentation. In most research cases, experiments were conducted on individual data sets only and the results were obtained for that specific data sample. Our future goal is to develop a DL method that segments OD and OC in any kind of eye fundus image but the application of the mixed training data strategy is in the initiation stage and the image preprocessing is not discussed. Therefore, the aim of this paper is to evaluate the mage preprocessing impact on OD and OC segmentation in different eye fundus images aligned by size. We adopted a mixed training data strategy by combining images of DRISHTI-GS, REFUGE, and RIM-ONE datasets, and applied image resizing incorporating various interpolation methods, namely bilinear, nearest neighbor, and bicubic for image resolution alignment. The impact of image preprocessing on OD and OC segmentation was evaluated using three convolutional neural networks Attention U-Net, Residual Attention U-Net (RAUNET), and U-Net++. The experimental results show that the most accurate segmentation is achieved by resizing images to a size of 512 x 512 px and applying bicubic interpolation. The highest Dice of 0.979 for OD and 0.877 for OC are achieved on RISHTI-GS test dataset, 0.973 for OD and 0.874 for OC on the REFUGE test dataset, 0.977 for OD and 0:855 for OC on RIM-ONE test dataset. Anova and Levene’s tests with statistically significant evidence at α = 0.05 show that the chosen size in image resizing has impact on the OD and OC segmentation results, meanwhile, the interpolation method does influent OC segmentation only.
... Spatial information is preserved through identity connections at each scale before upscaling/downscaling operations. We implemented U-Net based architecture [46] with a ResNet-34 as our encoder and added a self-attention layer [47] similar to [48] to preserve the global dependencies during the reconstruction of the target image. The loss function of the generator was mean average error or L1. ...
Article
Full-text available
Satellite sensors like Landsat 8 OLI (L8) and Sentinel-2 MSI (S2) provide valuable multispectral Earth observations that differ in spatial resolution and spectral bands, limiting synergistic use. L8 has a 30 m resolution and a lower revisit frequency, while S2 offers up to a 10 m resolution and more spectral bands, such as red edge bands. Translating observations from L8 to S2 can increase data availability by combining their images to leverage the unique strengths of each product. In this study, a conditional generative adversarial network (CGAN) is developed to perform sensor-specific domain translation focused on green, near-infrared (NIR), and red edge bands. The models were trained on the pairs of co-located L8-S2 imagery from multiple locations. The CGAN aims to downscale 30 m L8 bands to 10 m S2-like green and 20 m S2-like NIR and red edge bands. Two translation methodologies are employed—direct single-step translation from L8 to S2 and indirect multistep translation. The direct approach involves predicting the S2-like bands in a single step from L8 bands. The multistep approach uses two steps—the initial model predicts the corresponding S2-like band that is available in L8, and then the final model predicts the unavailable S2-like red edge bands from the S2-like band predicted in the first step. Quantitative evaluation reveals that both approaches result in lower spectral distortion and higher spatial correlation compared to native L8 bands. Qualitative analysis supports the superior fidelity and robustness achieved through multistep translation. By translating L8 bands to higher spatial and spectral S2-like imagery, this work increases data availability for improved earth monitoring. The results validate CGANs for cross-sensor domain adaptation and provide a reusable computational framework for satellite image translation.
... Recently, numerous FCN-based models have been designed to accomplish semantic segmentation. For instance, [13] proposed a symmetrical encoder-decoder structure based on FCN, termed U-net, to reconstruct segmentation step by step. Yu and Koltun [14] proposed the dilated convolution to enlarge the receptive field without resolution loss, thereby improving segmentation by using contextual information. ...
Article
Full-text available
Current studies in few-shot semantic segmentation mostly utilize meta-learning frameworks to obtain models that can be generalized to new categories. However, these models trained on base classes with sufficient annotated samples are biased towards these base classes, which results in semantic confusion and ambiguity between base classes and new classes. A strategy is to use an additional base learner to recognize the objects of base classes and then refine the prediction results output by the meta learner. In this way, the interaction between these two learners and the way of combining results from the two learners are important. This paper proposes a new model, namely Distilling Base and Meta (DBAM) network by using self-attention mechanism and contrastive learning to enhance the few-shot segmentation performance. First, the self-attention-based ensemble module (SEM) is proposed to produce a more accurate adjustment factor for improving the fusion of two predictions of the two learners. Second, the prototype feature optimization module (PFOM) is proposed to provide an interaction between the two learners, which enhances the ability to distinguish the base classes from the target class by introducing contrastive learning loss. Extensive experiments have demonstrated that our method improves on the PASCAL-5 i under 1-shot and 5-shot settings, respectively.
... As shown in Figure 4, each model will fuse adjacent data and upsample data from the lower left model. Every two encoding models plus a decoding model can be considered as a small UNet network [36]. In addition, skip connections are used in the network when exceeding two decoding models to connect coarse-grained and fine-grained information, which can help the network model learn more useful knowledge. ...
Article
Full-text available
The automatic detection of defects (cortical fibers) in pickled mustard tubers (Chinese Zhacai) remains a challenge. Moreover, few papers have discussed detection based on the segmentation of the physical characteristics of this food. In this study, we designate cortical fibers in pickled mustard as the target class, while considering the background and the edible portion of pickled mustard as other classes. We attempt to realize an automatic defect-detection system to accurately and rapidly detect cortical fibers in pickled mustard based on multiple images combined with a UNet4+ segmentation model. A multispectral sensor (MS) covering nine wavebands with a resolution of 870 × 750 pixels and an imaging speed over two frames per second and a high-definition (HD), 4096 × 3000 pixel resolution imaging system were applied to obtain MS and HD images of 200 pickled mustard tuber samples. An improved imaging fusion method was applied to fuse the MS with HD images. After image fusion and other preprocessing methods, each image contained a target; 150 images were randomly selected as the training data and 50 images as the test data. Furthermore, a segmentation model called UNet4+ was developed to detect the cortical fibers in the pickled mustard tubers. Finally, the UNet4+ model was tested on three types of datasets (MS, HD, and fusion images), and the detection results were compared based on Recall, Precision, and Dice values. Our study indicates that the model can successfully detect cortical fibers within about a 30 ± 3 ms timeframe for each type of image. Among the three types of images, the fusion images achieved the highest mean average Dice value of 73.91% for the cortical fibers. At the same time, we compared the UNet4+ model with the UNet++ and UNet3+ models using the same fusion data; the results show that our model achieved better prediction performance for the Dice values, i.e., 9.72% and 27.41% higher than those of the UNet++ and UNet3+ models, respectively.
... In this paper, a U-Net network topology [18] is used for the wall segmentation part. The U-shaped connection of this topology offers unique benefits compared to other frameworks. ...
Article
Full-text available
Recognition and extraction of elements from house plans present significant challenges in the construction, decoration and interior design industries. To address this issue, this paper proposes a wall segmentation system for house plans that integrates deep learning and traditional methods. The system comprises several components, such as image preprocessing, main region extraction, wall segmentation and optimisation of wall smoothing. The study combined the rapidity of the traditional method with the robustness of deep learning to enable the extraction of walls from varied image styles and perform smoothing optimisation. The paper demonstrates that the proposed segmentation technique delivers an 89% mean intersection over union, a 94% detection rate and a 96% recognition accuracy. The research surpasses current findings in the same field. Additionally, when combined with the current house map dataset, the system presents a semantic categorisation dataset featuring 6000 images depicting a range of styles, in addition to a recognition dataset including 4000 images.
... We proposed LFMA-Net to predict the class of points using the U-shaped network framework proposed in Ref. [24]. This section demonstrates our network's details regarding the feature fusion module and MAP module. ...
Article
Full-text available
Semantic segmentation from a three‐dimensional point cloud is vital in autonomous driving, computer vision, and augmented reality. However, current semantic segmentation does not effectively use the point cloud's local geometric features and contextual information, essential for improving segmentation accuracy. A semantic segmentation network that uses local feature fusion and a multilayer attention mechanism is proposed to address these challenges. Specifically, the authors designed a local feature fusion module to encode the geometric and feature information separately, which fully leverages the point cloud's feature perception and geometric structure representation. Furthermore, the authors designed a multilayer attention pooling module consisting of local attention pooling and cascade attention pooling to extract contextual information. Local attention pooling is used to learn local neighbourhood information, and cascade attention pooling captures contextual information from deeper local neighbourhoods. Finally, an enhanced feature representation of important information is obtained by aggregating the features from the two deep attention pooling methods. Extensive experiments on large‐scale point‐cloud datasets Stanford 3D large‐scale indoor spaces and SemanticKITTI indicate that authors network shows excellent advantages over existing representative methods regarding local geometric feature description and global contextual relationships.
Poster
Full-text available
Phytoliths constitute microscopic plant biominerals of high importance to geosciences and archaeology. Despite of the valuable advances in phytolith analysis, typical phytolith classification is performed manually, which is usually time-consuming and may inherit human observer biases. Thus, an emerging challenge signifies the automatic classification of phytoliths that may enhance data homogeneity among researchers and facilitate reliable comparisons. The application of deep learning (DL) algorithms on phytolith analysis, offers an opportunity to classify morphotypes with a higher unbiased precision, continuous refinement as long as more data become available and even the potential to reveal inherent group dynamics. Herein, we implement a “fully convolutional network” (FCN) architecture to classify phytoliths extracted from wheats (Triticum spp.) using the dry method. Photomicrographs of phytoliths are acquired using optical microscopy, and morphotypes, morphologically unaltered at the highest possible level, are identified based on the standard literature. The photomicrographs are further manually annotated forming four classes of morphotypes linked to different anatomical plant parts (i.e. leaves, stem, and inflorescence). The resulting annotated dataset includes the classes of (a) Stoma, (b) Rondel, (c) Papillate, and (d) Elongate dendritic and is allocated to training, validation and testing data groups, feeding a U-Net neural network. The performance of the developed network architecture is assessed during training, by calculating the area of overlap between the predicted segmentation and the “ground truth”, in order to overcome the potential issue of unbalanced distribution of the classes. The results demonstrate that the model classifies and localizes the above classes of morphotypes in the predicted images with satisfactory accuracy. Although additional training samples and plant species datasets are required to optimise the results, the present dataset extracted from modern plant material is promising for building up the capacity of phytolith classification within unfamiliar datasets from natural sediments and archaeological contexts.
Article
Full-text available
Protein misfolding and aggregation play central roles in the pathogenesis of various neurodegenerative diseases (NDDs), including Huntington’s disease, which is caused by a genetic mutation in exon 1 of the Huntingtin protein (Httex1). The fluorescent labels commonly used to visualize and monitor the dynamics of protein expression have been shown to alter the biophysical properties of proteins and the final ultrastructure, composition, and toxic properties of the formed aggregates. To overcome this limitation, we present a method for label-free identification of NDD-associated aggregates (LINA). Our approach utilizes deep learning to detect unlabeled and unaltered Httex1 aggregates in living cells from transmitted-light images, without the need for fluorescent labeling. Our models are robust across imaging conditions and on aggregates formed by different constructs of Httex1. LINA enables the dynamic identification of label-free aggregates and measurement of their dry mass and area changes during their growth process, offering high speed, specificity, and simplicity to analyze protein aggregation dynamics and obtain high-fidelity information.
Conference Paper
The presented work proposes an effective approach for extracting abstract characteristics from image data using the autoencoder-based models. Since simple autoencoders do not deliver the desired result in building a feature map between the data samples, variations and domain-specific adjustments might improve the performance. To assist a model with more informative and representative samples, augmentation technique on small subset with the position and size invariance have been applied. To evaluate the efficiency, we employ simple autoencoder and U-Net models that take both data features and their relationships into consideration. The suggested autoencoder models are assessed on a collection of benchmark datasets, and the experimental findings demonstrate that, in comparison to other autoencoder variants, taking data relationships into account can lead to more robust features that accomplish reduce construction loss and then reduced rate of errors in subsequent classification.
Article
Background High‐resolution magnetic resonance imaging (MRI) with excellent soft‐tissue contrast is a valuable tool utilized for diagnosis and prognosis. However, MRI sequences with long acquisition time are susceptible to motion artifacts, which can adversely affect the accuracy of post‐processing algorithms. Purpose This study proposes a novel retrospective motion correction method named “motion artifact reduction using conditional diffusion probabilistic model” (MAR‐CDPM). The MAR‐CDPM aimed to remove motion artifacts from multicenter three‐dimensional contrast‐enhanced T1 magnetization‐prepared rapid acquisition gradient echo (3D ceT1 MPRAGE) brain dataset with different brain tumor types. Materials and methods This study employed two publicly accessible MRI datasets: one containing 3D ceT1 MPRAGE and 2D T2‐fluid attenuated inversion recovery (FLAIR) images from 230 patients with diverse brain tumors, and the other comprising 3D T1‐weighted (T1W) MRI images of 148 healthy volunteers, which included real motion artifacts. The former was used to train and evaluate the model using the in silico data, and the latter was used to evaluate the model performance to remove real motion artifacts. A motion simulation was performed in k ‐space domain to generate an in silico dataset with minor, moderate, and heavy distortion levels. The diffusion process of the MAR‐CDPM was then implemented in k ‐space to convert structure data into Gaussian noise by gradually increasing motion artifact levels. A conditional network with a Unet backbone was trained to reverse the diffusion process to convert the distorted images to structured data. The MAR‐CDPM was trained in two scenarios: one conditioning on the time step t of the diffusion process, and the other conditioning on both t and T2‐FLAIR images. The MAR‐CDPM was quantitatively and qualitatively compared with supervised Unet, Unet conditioned on T2‐FLAIR, CycleGAN, Pix2pix, and Pix2pix conditioned on T2‐FLAIR models. To quantify the spatial distortions and the level of remaining motion artifacts after applying the models, quantitative metrics were reported including normalized mean squared error (NMSE), structural similarity index (SSIM), multiscale structural similarity index (MS‐SSIM), peak signal‐to‐noise ratio (PSNR), visual information fidelity (VIF), and multiscale gradient magnitude similarity deviation (MS‐GMSD). Tukey's Honestly Significant Difference multiple comparison test was employed to quantify the difference between the models where p ‐value <0.05 was considered statistically significant. Results Qualitatively, MAR‐CDPM outperformed these methods in preserving soft‐tissue contrast and different brain regions. It also successfully preserved tumor boundaries for heavy motion artifacts, like the supervised method. Our MAR‐CDPM recovered motion‐free in silico images with the highest PSNR and VIF for all distortion levels where the differences were statistically significant ( p ‐values <0.05). In addition, our method conditioned on t and T2‐FLAIR outperformed ( p ‐values <0.05) other methods to remove motion artifacts from the in silico dataset in terms of NMSE, MS‐SSIM, SSIM, and MS‐GMSD. Moreover, our method conditioned on only t outperformed generative models ( p ‐values <0.05) and had comparable performances compared with the supervised model ( p ‐values >0.05) to remove real motion artifacts. Conclusions The MAR‐CDPM could successfully remove motion artifacts from 3D ceT1 MPRAGE. It is particularly beneficial for elderly who may experience involuntary movements during high‐resolution MRI imaging with long acquisition times.
Article
Antimicrobial resistance (AMR) is a global crisis, responsible for ≈700 000 annual deaths, as reported by the World Health Organization. To counteract this growing threat to public health, innovative solutions for early detection and characterization of drug‐resistant bacterial strains are imperative. Surface‐enhanced Raman spectroscopy (SERS) combined with artificial intelligence (AI) technology presents a promising avenue to address this challenge. This review provides a concise overview of the latest advancements in SERS and AI, showcasing their transformative potential in the context of AMR. It explores the diverse methodologies proposed, highlighting their advantages and limitations. Additionally, the review underscores the significance of SERS in tandem use with machine learning (ML) and deep learning (DL) in combating AMR and emphasizes the importance of ongoing research and development efforts in this critical field. Future developments for this technology could transform the way antimicrobial resistance (AMR) is addressed and pave the way for novel approaches to the protection of public health worldwide.
Chapter
The diagnosis and treatment of different retinal diseases depend heavily on the ability to segment retinal blood vessels. Deep learning approaches take extensively used in recent years to segment retinal blood vessels. The encoder–decoder architecture, attention mechanism, dilated convolutions, and capsule networks are the four main elements that make up the EDADCN architecture, which is used to divide retinal blood vessels. The segmentation map is created by using the encoder–decoder architecture to extract high-level features from the input picture. Dilated convolutions capture multi-scale information for segmenting structures of various sizes and shapes, while the attention mechanism allows selective focus on important areas of the picture. Capsule networks are used to manage the various blood vessel sizes and shapes. The evaluation findings show that the proposed architecture outperforms cutting edge techniques for blood vessel segmentation. The suggested design achieves high segmentation accuracy, sensitivity, and specificity, which are crucial for accurate detection and treatment of retinal diseases. Additional strategies like transfer learning and ensemble methods could be incorporated to improve the efficiency of the architecture even more. The blood vessel segmentation automated from retinal images can be made possible by the suggested deep learning architecture, allowing for the speedy detection and treatment of retinal diseases.
Article
Full-text available
Breast cancer is a heterogeneous disease with variable survival outcomes. Pathologists grade the microscopic appearance of breast tissue using the Nottingham criteria, which are qualitative and do not account for noncancerous elements within the tumor microenvironment. Here we present the Histomic Prognostic Signature (HiPS), a comprehensive, interpretable scoring of the survival risk incurred by breast tumor microenvironment morphology. HiPS uses deep learning to accurately map cellular and tissue structures to measure epithelial, stromal, immune, and spatial interaction features. It was developed using a population-level cohort from the Cancer Prevention Study-II and validated using data from three independent cohorts, including the Prostate, Lung, Colorectal, and Ovarian Cancer trial, Cancer Prevention Study-3, and The Cancer Genome Atlas. HiPS consistently outperformed pathologists in predicting survival outcomes, independent of tumor–node–metastasis stage and pertinent variables. This was largely driven by stromal and immune features. In conclusion, HiPS is a robustly validated biomarker to support pathologists and improve patient prognosis.
Chapter
Accurate semantic segmentation of surgical instruments from images captured by the laparoscopic system plays a crucial role in ensuring the reliability of vision-based Robot-Assisted Minimally Invasive Surgery. Despite numerous notable advancements in semantic segmentation, the achieved segmentation accuracy still falls short of meeting the requirements for surgical safety. To enhance the accuracy further, we propose several modifications to a conventional medical image segmentation network, including a modified Feature Pyramid Module. Within this modified module, Patch-Embedding with varying rates and Self-Attention Blocks are employed to mitigate the loss of feature information while simultaneously expanding the receptive field. As for the network architecture, all feature maps extracted by the encoder are seamlessly integrated into the proposed modified Feature Pyramid Module via element-wise connections. The resulting output from this module is then transmitted to the decoder blocks at each stage. Considering these hybrid properties, the proposed method is called Hybrid U-Net. Subsequently, multiple experiments were conducted on two available medical datasets and the experimental results reveal that our proposed method outperforms the recent methods in terms of accuracy on both medical datasets.
Chapter
Airway segmentation is a prerequisite for diagnosing and screening pulmonary diseases. While computer aided algorithms have achieved great success in various medical image segmentation tasks, it remains a challenge in keeping the continuity of airway branches due to the special tubular shape. Some existing airway-specific segmentation models introduce topological representations such as neighbor connectivity and centerline overlapping into deep models and some other methods proposed customized network modules or training strategies based on the characteristics of airways. In this paper, we propose a large-kernel attention block to enlarge the receptive field as well as maintain the details of thin branches. We reformulate the segmentation problem into pixel-wise segmentation and connectivity prediction with a differentiable connectivity modeling technique, and also propose a self-correction loss to minimize the difference between these two tasks. In addition, the binary ground truth is transformed into distances from the boundary, and distance regression is used as additional supervision. Our proposed model has been evaluated on two public datasets, and the results show that our model outperforms other benchmark methods.
Chapter
With the driving force of powerful convolutional neural networks, image inpainting has made tremendous progress. Recently, transformer has demonstrated its effectiveness in various vision tasks, mainly due to its capacity to model long-term relationships. However, when it comes to image inpainting tasks, the transformer tends to fall short in terms of modeling local information, and interference from damaged regions can pose challenges. To tackle these issues, we introduce a novel Semantic U-shaped Transformer (SUT) in this work. The SUT is designed with spectral transformer blocks in its shallow layers, effectively capturing local information. Conversely, deeper layers utilize BRA transformer blocks to model global information. A key feature of the SUT is its attention mechanism, which employs bi-level routing attention. This approach significantly reduces the interference of damaged regions on overall information, making the SUT more suitable for image inpainting tasks. Experiments on several datasets indicate that the performance of the proposed method outperforms the current state-of-the-art (SOTA) inpainting approaches. In general, the PSNR of our method is on average 0.93 dB higher than SOTA, and the SSIM is higher by 0.026.
Chapter
Monocular depth estimation is a critical task in computer vision, and self-supervised deep learning methods have achieved remarkable results in recent years. However, these models often struggle on camera generalization, i.e. at sequences captured by unseen cameras. To address this challenge, we present a new public custom dataset created using the CARLA simulator [4], consisting of three video sequences recorded by five different cameras with varying focal distances. This dataset has been created due to the absence of public datasets containing identical sequences captured by different cameras. Additionally, it is proposed in this paper the use of adversarial training to improve the models’ robustness to intrinsic camera parameter changes, enabling accurate depth estimation regardless of the recording camera. The results of our proposed architecture are compared with a baseline model, hence being evaluated the effectiveness of adversarial training and demonstrating its potential benefits both on our synthetic dataset and on the KITTI benchmark [8] as the reference dataset to evaluate depth estimation.
Chapter
Psoriasis is a dermatological lesion that manifests in several regions of the body. Its late diagnosis can generate the aggravation of the disease itself, as well as of the comorbidities associated with it. The proposed work presents a computational system for image classification in smartphones, through deep convolutional neural networks, to assist the process of diagnosis of psoriasis. The dataset and the classification algorithms used revealed that the classification of psoriasis lesions was most accurate with unsegmented and unprocessed images, indicating that deep learning networks are able to do a good feature selection. Smaller models have a lower accuracy, although they are more adequate for environments with power and memory restrictions, such as smartphones.
Chapter
We study the problem of predicting hierarchical image segmentations using supervised deep learning. While deep learning methods are now widely used as contour detectors, the lack of image datasets with hierarchical annotations has prevented researchers from explicitly training models to predict hierarchical contours. Image segmentation has been widely studied, but it is limited by only proposing a segmentation at a single scale. Hierarchical image segmentation solves this problem by proposing segmentation at multiple scales, capturing objects and structures at different levels of detail. However, this area of research appears to be less explored and therefore no hierarchical image segmentation dataset exists. In this paper, we provide a hierarchical adaptation of the Pascal-Part dataset [2], and use it to train a neural network for hierarchical image segmentation prediction. We demonstrate the efficiency of the proposed method through three benchmarks: the precision-recall and F-score benchmarks for boundary location, the level recovery fraction for assessing hierarchy quality, and the false discovery fraction. We show that our method successfully learns hierarchical boundaries in the correct order, and achieves better performance than the state-of-the-art model trained on single-scale segmentations.
Chapter
The model performance on cross-domain pulmonary nodule detection usually degrades because of the significant shift in data distributions and the scarcity of annotated medical data in the test scenarios. Current approaches to cross-domain object detection assume that training data from the source domain are freely available; however, such an assumption is implausible in the medical field, as the data are confidential and cannot be shared due to privacy concerns. Thus, this paper introduces source data-free cross-domain pulmonary nodule detection. In this setting, only a pre-trained model from the source domain and a few annotated samples from the target domain are available. We introduce a novel method to tackle this issue, adapting the feature extraction module for the target domain through minimizing the proposed General Entropy (GE). Specifically, we optimize the batch normalization (BN) layers of the model by GE minimization. Thus, the dataset-level statistics of the target domain are utilized for optimization and inference. Furthermore, we tune the detection head of the model using annotated target samples to mitigate the rater difference and improve the accuracy. Extensive experiments on three different pulmonary nodule datasets show the efficacy of our method for source data-absent cross-domain pulmonary nodule detection.
Chapter
Novel view synthesis (NVS) aims to synthesize photo-realistic images depicting a scene by utilizing existing source images. The synthesized images are supposed to be as close as possible to the scene content. We present Deep Normalized Stable View Synthesis (DNSVS), an NVS method for large-scale scenes based on the pipeline of Stable View Synthesis (SVS). SVS combines neural networks with the 3D scene representation obtained from structure-from-motion and multi-view stereo, where the view rays corresponding to each surface point of the scene representation and the source view feature vector together yield a value of each pixel in the target view. However, it weakens geometric information in the refinement stage, resulting in blur and artifacts in novel views. To address this, we propose DNSVS that leverages the depth map to enhance the rendering process via a normalization approach. The proposed method is evaluated on the Tanks and Temples dataset, as well as the FVS dataset. The average Learned Perceptual Image Patch Similarity (LPIPS) of our results is better than state-of-the-art NVS methods by 0.12%, indicating the superiority of our method.
Article
Full-text available
The current mainstream image semantic segmentation networks often suffer from mis-segmentation, segmentation discontinuity, and high model complexity, which limit their application in real-time processing scenarios. The work established a lightweight neural network model for semantic segmentation to address this issue. The network used a dual-branch strategy to solve low semantic boundary segmentation accuracy in semantic segmentation tasks. The semantic branch applied the characteristics of the deeplabv3 + model structure. Besides, dilated convolutions with different dilation rates in the encoder were used to expand the receptive field of convolutional operations and enhance the ability to capture local features. The boundary refinement branch extracted second-order differential features of the input image through the Laplace operator, and it gradually refined the second-order differential features through a feature refinement extraction module to obtain advanced semantic features. A convolutional block attention module was introduced to filter the features from both the channel and spatial dimensions and finally fused with the semantic branch to achieve constrained segmentation boundary effects. Based on this, a multi-channel attention fusion module was proposed to aggregate features from different stages. Low-resolution features were first up-sampled and then fused with high-resolution features to enhance the spatial information of high-level features. Proposed network’s effectiveness was demonstrated through extensive experiments on the MaSTr1325 dataset, the MID dataset, the Camvid dataset, and the PASCAL VOC2012 dataset, with mIoU of 98.1, 73.1, and 81.1% and speeds of 111.40, 100.36, and 111.43 fps on a single NVIDIA RTX 3070 GPU, respectively.
Article
Full-text available
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation. However, the information in this layer may be too coarse to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation[21], where we improve state-of-the-art from 49.7[21] mean AP^r to 59.0, keypoint localization, where we get a 3.3 point boost over[19] and part labeling, where we show a 6.6 point gain over a strong baseline.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Full-text available
Contextual information plays an important role in solving vision problems such as image segmentation. However, extracting contextual information and using it in an effective way remains a difficult problem. To address this challenge, we propose a multi-resolution contextual framework, called cascaded hierarchical model (CHM), which learns contextual information in a hierarchical framework for image segmentation. At each level of the hierarchy, a classifier is trained based on down sampled input images and outputs of previous levels. Our model then incorporates the resulting multi-resolution contextual information into a classifier to segment the input image at original resolution. We repeat this procedure by cascading the hierarchical framework to improve the segmentation accuracy. Multiple classifiers are learned in the CHM, therefore, a fast and accurate classifier is required to make the training tractable. The classifier also needs to be robust against over fitting due to the large number of parameters learned during training. We introduce a novel classification scheme, called logistic disjunctive normal networks (LDNN), which consists of one adaptive layer of feature detectors implemented by logistic sigmoid functions followed by two fixed layers of logical units that compute conjunctions and disjunctions, respectively. We demonstrate that LDNN outperforms state-of-the-art classifiers and can be used in the CHM to improve object segmentation performance.
Article
Full-text available
Motivation: Automatic tracking of cells in multidimensional timelapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this paper, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge website (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data, including video samples and algorithm descriptions are available at Bioinformatics online.
Article
Full-text available
The analysis of microcircuitry (the connectivity at the level of individual neuronal processes and synapses), which is indispensable for our understanding of brain function, is based on serial transmission electron microscopy (TEM) or one of its modern variants. Due to technical limitations, most previous studies that used serial TEM recorded relatively small stacks of individual neurons. As a result, our knowledge of microcircuitry in any nervous system is very limited. We applied the software package TrakEM2 to reconstruct neuronal microcircuitry from TEM sections of a small brain, the early larval brain of Drosophila melanogaster. TrakEM2 enables us to embed the analysis of the TEM image volumes at the microcircuit level into a light microscopically derived neuro-anatomical framework, by registering confocal stacks containing sparsely labeled neural structures with the TEM image volume. We imaged two sets of serial TEM sections of the Drosophila first instar larval brain neuropile and one ventral nerve cord segment, and here report our first results pertaining to Drosophila brain microcircuitry. Terminal neurites fall into a small number of generic classes termed globular, varicose, axiform, and dendritiform. Globular and varicose neurites have large diameter segments that carry almost exclusively presynaptic sites. Dendritiform neurites are thin, highly branched processes that are almost exclusively postsynaptic. Due to the high branching density of dendritiform fibers and the fact that synapses are polyadic, neurites are highly interconnected even within small neuropile volumes. We describe the network motifs most frequently encountered in the Drosophila neuropile. Our study introduces an approach towards a comprehensive anatomical reconstruction of neuronal microcircuitry and delivers microcircuitry comparisons between vertebrate and insect neuropile.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
We address a central problem of neuroanatomy, namely, the automatic segmen-tation of neuronal structures depicted in stacks of electron microscopy (EM) im-ages. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or non-membrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succes-sion of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific post-processing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Discriminative unsupervised feature learning with convolutional neural networks
  • A Dosovitskiy
  • J T Springenberg
  • M Riedmiller
  • T Brox
Deep neural networks segment neuronal membranes in electron microscopy images
  • D C Ciresan
  • L M Gambardella
  • A Giusti
  • J Schmidhuber