Conference Paper

Deep Semantic Segmentation Models in Computer Vision

To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

The interest in Deep Learning (DL) has seen an exponential growth in the last ten years, producing a significant increase in both theoretical and applicative studies. On the one hand, the versatility and the ability to tackle complex tasks have led to the rapid and widespread diffusion of DL technologies. On the other hand, the dizzying increase in the availability of biomedical data has made classical analyses, carried out by human experts, progressively more unlikely. Contextually, the need for efficient and reliable automatic tools to support clinicians, at least in the most demanding tasks, has become increasingly pressing. In this survey, we will introduce a broad overview of DL models and their applications to biomedical data processing, specifically to medical image analysis, sequence processing (RNA and proteins) and graph modeling of molecular data interactions. First, the fundamental key concepts of DL architectures will be introduced, with particular reference to neural networks for structured data, convolutional neural networks, generative adversarial models, and siamese architectures. Subsequently, their applicability for the analysis of different types of biomedical data will be shown, in areas ranging from diagnostics to the understanding of the characteristics underlying the process of transcription and translation of our genetic code, up to the discovery of new drugs. Finally, the prospects and future expectations of DL applications to biomedical data will be discussed.
Full-text available
In recent years, the Ribosome profiling technique (Ribo–seq) has emerged as a powerful method for globally monitoring the translation process in vivo at single nucleotide resolution. Based on deep sequencing of mRNA fragments, Ribo–seq allows to obtain profiles that reflect the time spent by ribosomes in translating each part of an open reading frame. Unfortunately, the profiles produced by this method can vary significantly in different experimental setups, being characterized by a poor reproducibility. To address this problem, we have employed a statistical method for the identification of highly reproducible Ribo–seq profiles, which was tested on a set of E. coli genes. State-of-the-art artificial neural network models have been used to validate the quality of the produced sequences. Moreover, new insights into the dynamics of ribosome translation have been provided through a statistical analysis on the obtained sequences.
Full-text available
In this paper, we use Generative Adversarial Networks (GANs) to synthesize high-quality retinal images along with the corresponding semantic label-maps, instead of real images during training of a segmentation network. Different from other previous proposals, we employ a two-step approach: first, a progressively growing GAN is trained to generate the semantic label-maps, which describes the blood vessel structure (i.e., the vasculature); second, an image-to-image translation approach is used to obtain realistic retinal images from the generated vasculature. The adoption of a two-stage process simplifies the generation task, so that the network training requires fewer images with consequent lower memory usage. Moreover, learning is effective, and with only a handful of training samples, our approach generates realistic high-resolution images, which can be successfully used to enlarge small available datasets. Comparable results were obtained by employing only synthetic images in place of real data during training. The practical viability of the proposed approach was demonstrated on two well-established benchmark sets for retinal vessel segmentation—both containing a very small number of training samples—obtaining better performance with respect to state-of-the-art techniques.
Full-text available
Multi-organ segmentation of X-ray images is of fundamental importance for computer aided diagnosis systems. However, the most advanced semantic segmentation methods rely on deep learning and require a huge amount of labeled images, which are rarely available due to both the high cost of human resources and the time required for labeling. In this paper, we present a novel multi-stage generation algorithm based on Generative Adversarial Networks (GANs) that can produce synthetic images along with their semantic labels and can be used for data augmentation. The main feature of the method is that, unlike other approaches, generation occurs in several stages, which simplifies the procedure and allows it to be used on very small datasets. The method was evaluated on the segmentation of chest radiographic images, showing promising results. The multi-stage approach achieves state-of-the-art and, when very few images are used to train the GANs, outperforms the corresponding single-stage approach.
Full-text available
The semantic image segmentation task consists of classifying each pixel of an image into an instance, where each instance corresponds to a class. This task is a part of the concept of scene understanding or better explaining the global context of an image. In the medical image analysis domain, image segmentation can be used for image-guided interventions, radiotherapy, or improved radiological diagnostics. In this review, we categorize the leading deep learning-based medical and non-medical image segmentation solutions into six main groups of deep architectural, data synthesis-based, loss function-based, sequenced models, weakly supervised, and multi-task methods and provide a comprehensive review of the contributions in each of these groups. Further, for each group, we analyze each variant of these groups and discuss the limitations of the current approaches and present potential future research directions for semantic image segmentation.
Full-text available
Numerous methods that automatically identify subjects depicted in sketches as described by eyewitnesses have been implemented, but their performance often degrades when using real-world forensic sketches and extended galleries that mimic law enforcement mug-shot galleries. Moreover, little work has been done to apply deep learning for face photo-sketch recognition despite its success in numerous application domains including traditional face recognition. This is primarily due to the limited number of sketch images available, which are insufficient to robustly train large networks. This paper aims to tackle these issues with the following contributions: (i) a state-of-the-art model pre-trained for face photo recognition is tuned for face photo-sketch recognition by applying transfer learning, (ii) a 3D morphable model is used to synthesise new images and artificially expand the training data, allowing the network to prevent over-fitting and learn better features, (iii) multiple synthetic sketches are also used in the testing stage to improve performance, and (iv) fusion of the proposed method with a state-of-the-art algorithm is shown to further boost performance. An extensive evaluation of several popular and state-of-the-art algorithms is also performed using publicly available datasets, thereby serving as a benchmark for future algorithms. Compared to a leading method, the proposed framework is shown to reduce the error rate by 80.7% for viewed sketches and lowers the mean retrieval rank by 32.5% for real-world forensic sketches.
Full-text available
INTRODUCTION The ability of multilayer back-propagation networks to learn complex, high-dimensional, nonlinear mappings from large collections of examples makes them obvious candidates for image recognition or speech recognition tasks (see PATTERN RECOGNITION AND NEURAL NETWORKS). In the traditional model of pattern recognition, a hand-designed feature extractor gathers relevant information from the input and eliminates irrelevant variabilities. A trainable classifier then categorizes the resulting feature vectors (or strings of symbols) into classes. In this scheme, standard, fully-connected multilayer networks can be used as classifiers. A potentially more interesting scheme is to eliminate the feature extractor, feeding the network with "raw" inputs (e.g. normalized images), and to rely on backpropagation to turn the first few layers into an appropriate feature extractor. While this can be done with an ordinary fully connected feed-forward network with some success for tasks
Conference Paper
There is growing evidence that the use of stringent and dichotomic diagnostic categories in many medical disciplines (particularly 'brain sciences' as neurology and psychiatry) is an oversimplification. Although clear diagnostic boundaries remain useful for patients, families, and their access to dedicated NHS and health care services, the traditional dichotomic categories are not helpful to describe the complexity and large heterogeneity of symptoms across many and overlapping clinical phenotypes. With the advent of 'big' multimodal neuroimaging databases, data-driven stratification of the wide spectrum of healthy human physiology or disease based on neuroimages is theoretically become possible. However, this conceptual framework is hampered by severe computational constraints. In this paper we present a novel, deep learning based encode-decode architecture which leverages several parameter efficiency techniques generate latent deep embedding which compress the information contained in a full 3D neuroimaging volume by a factor 1000 while still retaining anatomical detail and hence rendering the subsequent stratification problem tractable. We train our architecture on 1003 brain scan derived from the human connectome project and demonstrate the faithfulness of the obtained reconstructions. Further, we employ a data driven clustering technique driven by a grid search in hyperparameter space to identify six different strata within the 1003 healthy community dwelling individuals which turn out to correspond to highly significant group differences in both physiological and cognitive data. Indicating that the well-known relationships between such variables and brain structure can be probed in an unsupervised manner through our novel architecture and pipeline. This opens the door to a variety of previously inaccessible applications in the realm of data driven stratification of large cohorts based on neuroimaging data.Clinical Relevance -With our approach, each person can be described and classified within a multi-dimensional space of data, where they are uniquely classified according to their individual anatomy, physiology and disease-related anatomical and physiological alterations.
Emergence of large datasets and resilience of convolutional models have enabled successful training of very large semantic segmentation models. However, high capacity implies high computational complexity and therefore hinders real-time operation. We therefore study compact architectures which aim at high accuracy in spite of modest capacity. We propose a novel semantic segmentation approach based on shared pyramidal representation and fusion of heterogeneous features along the upsampling path. The proposed pyramidal fusion approach is especially effective for dense inference in images with large scale variance due to strong regularization effects induced by feature sharing across the resolution pyramid. Interpretation of the decision process suggests that our approach succeeds by acting as a large ensemble of relatively simple models, as well as due to large receptive range and strong gradient flow towards early layers. Our best model achieves 76.4% mIoU on Cityscapes test and runs in real time on low-power embedded devices.
Providing pixel–level supervisions for scene text segmentation is inherently difficult and costly, so that only few small datasets are available for this task. To face the scarcity of training data, previous approaches based on Convolutional Neural Networks (CNNs) rely on the use of a synthetic dataset for pre–training. However, synthetic data cannot reproduce the complexity and variability of natural images. In this work, we propose to use a weakly supervised learning approach to reduce the domain–shift between synthetic and real data. Leveraging the bounding–box supervision of the COCO–Text and the MLT datasets, we generate weak pixel–level supervisions of real images. In particular, the COCO–Text–Segmentation (COCO_TS) and the MLT–Segmentation (MLT_S) datasets are created and released. These two datasets are used to train a CNN, the Segmentation Multiscale Attention Network (SMANet), which is specifically designed to face some peculiarities of the scene text segmentation task. The SMANet is trained end–to–end on the proposed datasets, and the experiments show that COCO_TS and MLT_S are a valid alternative to synthetic images, allowing to use only a fraction of the training samples, with a significant improvement in performance.
Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
Alkaptonuria (AKU) is an ultra-rare autosomal recessive disorder (MIM 203500) that is caused byby a complex set of mutations in homogentisate 1,2-dioxygenasegene and consequent accumulation of homogentisic acid (HGA), causing a significant protein oxidation. A secondary form of amyloidosis was identified in AKU and related to high circulating serum amyloid A (SAA) levels, which are linked with inflammation and oxidative stress and might contribute to disease progression and patients' poor quality of life. Recently, we reported that inflammatory markers (SAA and chitotriosidase) and oxidative stress markers (protein thiolation index) might be disease activity markers in AKU. Thanks to an international network, we collected genotypic, phenotypic, and clinical data from more than 200 patients with AKU. These data are currently stored in our AKU database, named ApreciseKUre. In this work, we developed an algorithm able to make predictions about the oxidative status trend of each patient with AKU based on 55 predictors, namely circulating HGA, body mass index, total cholesterol, SAA, and chitotriosidase. Our general aim is to integrate the data of apparently heterogeneous patients with AKUAKU by using specific bioinformatics tools, in order to identify pivotal mechanisms involved in AKU for a preventive, predictive, and personalized medicine approach to AKU.-Cicaloni, V., Spiga, O., Dimitri, G. M., Maiocchi, R., Millucci, L., Giustarini, D., Bernardini, G., Bernini, A., Marzocchi, B., Braconi, D., Santucci, A. Interactive alkaptonuria database: investigating clinical data to improve patient care in a rare disease.
Semantic segmentation, also called scene labeling, refers to the process of assigning a semantic label (e.g. car, people, and road) to each pixel of an image. It is an essential data processing step for robots and other unmanned systems to understand the surrounding scene. Despite decades of efforts, semantic segmentation is still a very challenging task due to large variations in natural scenes. In this paper, we provide a systematic review of recent advances in this field. In particular, three categories of methods are reviewed and compared, including those based on hand-engineered features, learned features and weakly supervised learning. In addition, we describe a number of popular datasets aiming for facilitating the development of new segmentation algorithms. In order to demonstrate the advantages and disadvantages of different semantic segmentation models, we conduct a series of comparisons between them. Deep discussions about the comparisons are also provided. Finally, this review is concluded by discussing future directions and challenges in this important field of research.
Learning machines for pattern recognition, such as neural networks or support vector machines, are usually conceived to process real–valued vectors with predefined dimensionality even if, in many real–world applications, relevant information is inherently organized into entities and relationships between them. Instead, Graph Neural Networks (GNNs) can directly process structured data, guaranteeing universal approximation of many practically useful functions on graphs. GNNs, that do not strictly meet the definition of deep architectures, are based on the unfolding mechanism during learning, that, in practice, yields networks that have the same depth of the data structures they process. However, GNNs may be hindered by the long–term dependency problem, i.e. the difficulty in taking into account information coming from peripheral nodes within graphs — due to the local nature of the procedures for updating the state and the weights. To overcome this limitation, GNNs may be cascaded to form layered architectures, called Layered GNNs (LGNNs). Each GNN in the cascade is trained based on the original graph “enriched” with the information computed by the previous layer, to implement a sort of incremental learning framework, able to take into account progressively further information. The applicability of LGNNs will be illustrated both with respect to a classical problem in graph–theory and to pattern recognition problems in bioinformatics.
Conference Paper
Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
The semantic image segmentation task presents a trade-off between test time accuracy and training time annotation cost. Detailed per-pixel annotations enable training accurate models but are very time-consuming to obtain; image-level class labels are an order of magnitude cheaper but result in less accurate models. We take a natural step from image-level annotation towards stronger supervision: we ask annotators to point to an object if one exists. We incorporate this point supervision along with a novel objectness potential in the training loss function of a CNN model. Experimental results on the PASCAL VOC 2012 benchmark reveal that the combined effect of point-level supervision and objectness potential yields an improvement of \(12.9\,\%\) mIOU over image-level supervision. Further, we demonstrate that models trained with point-level supervision are more accurate than models trained with image-level, squiggle-level or full supervision given a fixed annotation budget.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at .
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
In this paper, a new system that improves the image obtained by an array of ultrasonic sensors using Electrical Resistance Tomography (ERT) is presented. One of its target applications can be in automatic exploration of soft tissues, where different organs and eventual anomalies exhibit simultaneously different electrical conductivities and different acoustic impedances. The exclusive usage of the ERT technique usually leads to some significant uncertainties around the regions' boundaries and usually generates images with relatively low resolutions. The proposed method shows that by properly combining this technique with an ultrasonic-based method, which can provide good localization of some edge points, the accuracy of the shape of individual cells can be improved if these edge points are used as constraints during the inversion procedure. The performance of the proposed reconstruction method was assessed by conducting extensive tests on some simulated phantoms which mimic soft tissues. The obtained results clearly show the outperformance of this method over single modalities techniques that use either ultrasound or ERT imaging.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Deep learning in bioinformatics
  • Byunghan Seonwoo Min
  • Sungroh Lee
  • Yoon
Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioinformatics. Briefings in bioinformatics, 18(5):851-869, 2017.
  • Zhengxia Zou
  • Zhenwei Shi
  • Yuhong Guo
  • Jieping Ye
Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2019.
  • M Robert
  • Linda G Haralick
  • Shapiro
Robert M Haralick and Linda G Shapiro. Computer and robot vision, volume 1. Addison-wesley Reading, 1992.
Learning deconvolution network for semantic segmentation
  • Hyeonwoo Noh
  • Seunghoon Hong
  • Bohyung Han
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
Rethinking atrous convolution for semantic image segmentation
  • Liang-Chieh Chen
  • George Papandreou
  • Florian Schroff
  • Hartwig Adam
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling
  • Vijay Badrinarayanan
  • Ankur Handa
  • Roberto Cipolla
Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293, 2015.
On efficient real-time semantic segmentation: A survey
  • J Christopher
  • Muhammad Holder
  • Shafique
Christopher J Holder and Muhammad Shafique. On efficient real-time semantic segmentation: A survey. arXiv preprint arXiv:2206.08605, 2022.
Constrained convolutional neural networks for weakly supervised segmentation
  • Deepak Pathak
  • Philipp Krahenbuhl
  • Trevor Darrell
Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1796-1804, 2015.