Conference Paper

A Differentiable Gaussian Prototype Layer for Explainable Fruit Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Chapter
Prototypical part networks predict not only the class of an image but also explain why it was chosen. In some cases, however, the detected features do not relate to the depicted objects. This is especially relevant in prototypical part networks as prototypes are meant to code for high-level concepts such as semantic parts of objects. This raises the question how the inference of the networks can be improved. Here we suggest to enable the user to give hints and interactively correct the model’s reasoning. It shows that even correct classifications can rely on unreasonable or spurious prototypes that result from confounding variables in a dataset. Hence, we propose simple yet effective interaction schemes for inference adjustment that enable the user to interactively revise the prototypes chosen by the model. Spurious prototypes can be removed or altered to become sensitive to object-features by the suggested mode of training. Interactive prototype revision allows machine learning naïve users to adjust the logic of reasoning and change the way prototypical part networks make a decision.
Chapter
Full-text available
Existing prototypical-based models address the black-box nature of deep learning. However, they are sub-optimal as they often assume separate prototypes for each class, require multi-step optimization, make decisions based on prototype absence (so-called negative reasoning process), and derive vague prototypes. To address those shortcomings, we introduce ProtoPool, an interpretable prototype-based model with positive reasoning and three main novelties. Firstly, we reuse prototypes in classes, which significantly decreases their number. Secondly, we allow automatic, fully differentiable assignment of prototypes to classes, which substantially simplifies the training process. Finally, we propose a new focal similarity function that contrasts the prototype from the background and consequently concentrates on more salient visual features. We show that ProtoPool obtains state-of-the-art accuracy on the CUB-200-2011 and the Stanford Cars datasets, substantially reducing the number of prototypes. We provide a theoretical analysis of the method and a user study to show that our prototypes capture more salient features than those obtained with competitive methods. We made the code available at https://github.com/gmum/ProtoPool.KeywordsDeep learningInterpretabilityCase-based reasoning
Article
Full-text available
Power distribution grids are typically installed outdoors and are exposed to environmental conditions. When contamination accumulates in the structures of the network, there may be shutdowns caused by electrical arcs. To improve the reliability of the network, visual inspections of the electrical power system can be carried out; these inspections can be automated using computer vision techniques based on deep neural networks. Based on this need, this paper proposes the Semi-ProtoPNet deep learning model to classify defective structures in the power distribution networks. The Semi-ProtoPNet deep neural network does not perform convex optimization of its last dense layer to maintain the impact of the negative reasoning process on image classification. The negative reasoning process rejects the incorrect classes of an input image; for this reason, it is possible to carry out an analysis with a low number of images that have different backgrounds, which is one of the challenges of this type of analysis. Semi-ProtoPNet achieves an accuracy of 97.22%, being superior to VGG-13, VGG-16, VGG-19, ResNet-34, ResNet-50, ResNet-152, DenseNet-121, DenseNet-161, DenseNet-201, and also models of the same class such as ProtoPNet, NP-ProtoPNet, Gen-ProtoPNet, and Ps-ProtoPNet.
Conference Paper
Full-text available
Vision transformers (ViTs), which have demonstrated a state-of-the-art performance in image classification, can also visualize global interpretations through attention-based contributions. However, the complexity of the model makes it difficult to interpret the decision-making process, and the ambiguity of the attention maps can cause incorrect correlations between image patches. In this study, we propose a new ViT neural tree decoder (ViT-NeT). A ViT acts as a backbone, and to solve its limitations, the output contextual image patches are applied to the proposed NeT. The NeT aims to accurately classify fine-grained objects with similar inter-class correlations and different intra-class correlations. In addition, it describes the decision-making process through a tree structure and prototype and enables a visual interpretation of the results. The proposed ViT-NeT is designed to not only improve the classification performance but also provide a human-friendly interpretation, which is effective in resolving the trade-off between performance and interpretability. We compared the performance of ViT-NeT with other state-of-art methods using widely used fine-grained visual categorization benchmark datasets and experimentally proved that the proposed method is superior in terms of the classification performance and interpretability. The code and models are publicly available at https://github.com/jumpsnack/ViT-NeT.
Article
Full-text available
We present an approach for efficiently training Gaussian Mixture Model (GMM) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and can thus be trained based on a random initial state. Furthermore, the approach allows mini-batch sizes as low as 1, which are typical for streaming-data settings. Major problems in such settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities. We introduce an adaptive annealing procedure to address the first problem, whereas numerical instabilities are eliminated by an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and non-visual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization. It also compares favorably to an online variant of Expectation-Maximization (EM)—stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.
Article
Full-text available
Interpretation of the reasoning process of a prediction made by a deep learning model is always desired. However, when it comes to the predictions of a deep learning model that directly impacts on the lives of people then the interpretation becomes a necessity. In this paper, we introduce a deep learning model: negative-positive prototypical part network (NP-ProtoPNet). This model attempts to imitate human reasoning for image recognition while comparing the parts of a test image with the corresponding parts of the images from known classes. We demonstrate our model on the dataset of chest X-ray images of Covid-19 patients, pneumonia patients and normal people. The accuracy and precision that our model receives is on par with the best performing non-interpretable deep learning models.
Conference Paper
Full-text available
Semantic segmentation assigns a class label to each image pixel. This dense prediction problem requires large amounts of manually annotated data, which is often unavailable. Few-shot learning aims to learn the pattern of a new category with only a few annotated examples. In this paper, we formulate the few-shot semantic segmentation problem from 1-way (class) to N-way (classes). Inspired by few-shot classification, we propose a generalized framework for few-shot semantic segmentation with an alternative training scheme. The framework is based on prototype learning and metric learning. Our approach outperforms the baselines by a large margin and shows comparable performance for 1-way few-shot semantic segmentation on PASCAL VOC 2012 dataset.
Article
Full-text available
Superpixels are becoming increasingly popular for use in computer vision applications. However, there are few algorithms that output a desired number of regular, compact superpixels with a low computational overhead. We introduce a novel algorithm that clusters pixels in the combined five-dimensional color and image plane space to efficiently generate compact, nearly uniform superpixels. The simplicity of our approach makes it extremely easy to use -- a lone parameter specifies the number of superpixels -- and the efficiency of the algorithm makes it very practical. Experiments show that our approach produces superpixels at a lower computational cost while achieving a segmentation quality equal to or greater than four state-of-the-art methods, as measured by boundary recall and under-segmentation error. We also demonstrate the benefits of our superpixel approach in contrast to existing methods for two tasks in which superpixels have already been shown to increase performance over pixel-based methods.
Chapter
Prototypical part networks predict not only the class of an image but also explain why it was chosen. In some cases, however, the detected features do not relate to the depicted objects. This is especially relevant in prototypical part networks as prototypes are meant to code for high-level concepts such as semantic parts of objects. This raises the question how the inference of the networks can be improved. Here we suggest to enable the user to give hints and interactively correct the model’s reasoning. It shows that even correct classifications can rely on unreasonable or spurious prototypes that result from confounding variables in a dataset. Hence, we propose simple yet effective interaction schemes for inference adjustment that enable the user to interactively revise the prototypes chosen by the model. Spurious prototypes can be removed or altered to become sensitive to object-features by the suggested mode of training. Interactive prototype revision allows machine learning naïve users to adjust the logic of reasoning and change the way prototypical part networks make a decision.
Article
Current machine learning models have shown high efficiency in solving a wide variety of real-world problems. However, their black box character poses a major challenge for the comprehensibility and traceability of the underlying decision-making strategies. As a remedy, numerous post-hoc and self-explanation methods have been developed to interpret the models’ behavior. Those methods, in addition, enable the identification of artifacts that, inherent in the training data, can be erroneously learned by the model as class-relevant features. In this work, we provide a detailed case study of a representative for the state-of-the-art self-explaining network, ProtoPNet, in the presence of a spectrum of artifacts. Accordingly, we identify the main drawbacks of ProtoPNet, especially its coarse and spatially imprecise explanations. We address these limitations by introducing Prototypical Relevance Propagation (PRP), a novel method for generating more precise model-aware explanations. Furthermore, in order to obtain a clean, artifact-free dataset, we propose to use multi-view clustering strategies for segregating the artifact images using the PRP explanations, thereby suppressing the potential artifact learning in the models. The code will be made available on github upon acceptance.
Article
Vision models are interpretable when they classify objects on the basis of features that a person can directly understand. Recently, methods relying on visual feature prototypes have been developed for this purpose. However, in contrast to how humans categorize objects, these approaches have not yet made use of any taxonomical organization of class labels. With such an approach, for instance, we may see why a chimpanzee is classified as a chimpanzee, but not why it was considered to be a primate or even an animal. In this work we introduce a model that uses hierarchically organized prototypes to classify objects at every level in a predefined taxonomy. Hence, we may find distinct explanations for the prediction an image receives at each level of the taxonomy. The hierarchical prototypes enable the model to perform another important task: interpretably classifying images from previously unseen classes at the level of the taxonomy to which they correctly relate, e.g. classifying a hand gun as a weapon, when the only weapons in the training data are rifles. With a subset of ImageNet, we test our model against its counterpart black-box model on two tasks: 1) classification of data from familiar classes, and 2) classification of data from previously unseen classes at the appropriate level in the taxonomy. We find that our model performs approximately as well as its counterpart black-box model while allowing for each classification to be interpreted.
Article
Despite the recent progress in Graph Neural Networks (GNNs), it remains challenging to explain the predictions made by GNNs. Existing explanation methods mainly focus on post-hoc explanations where another explanatory model is employed to provide explanations for a trained GNN. The fact that post-hoc methods fail to reveal the original reasoning process of GNNs raises the need of building GNNs with built-in interpretability. In this work, we propose Prototype Graph Neural Network (ProtGNN), which combines prototype learning with GNNs and provides a new perspective on the explanations of GNNs. In ProtGNN, the explanations are naturally derived from the case-based reasoning process and are actually used during classification. The prediction of ProtGNN is obtained by comparing the inputs to a few learned prototypes in the latent space. Furthermore, for better interpretability and higher efficiency, a novel conditional subgraph sampling module is incorporated to indicate which part of the input graph is most similar to each prototype in ProtGNN+. Finally, we evaluate our method on a wide range of datasets and perform concrete case studies. Extensive results show that ProtGNN and ProtGNN+ can provide inherent interpretability while achieving accuracy on par with the non-interpretable counterparts.
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code and video resources are available at http://vis.xyz/pub/pcan.
Article
When neural networks are employed for high-stakes decision-making, it is desirable that they provide explanations for their prediction in order for us to understand the features that have contributed to the decision. At the same time, it is important to flag potential outliers for in-depth verification by domain experts. In this work we propose to unify two differing aspects of explainability with outlier detection. We argue for a broader adoption of prototype-based student networks capable of providing an example-based explanation for their prediction and at the same time identify regions of similarity between the predicted sample and the examples. The examples are real prototypical cases sampled from the training set via our novel iterative prototype replacement algorithm. Furthermore, we propose to use the prototype similarity scores for identifying outliers. We compare performances in terms of the classification, explanation quality, and outlier detection of our proposed network with other baselines. We show that our prototype-based networks beyond similarity kernels deliver meaningful explanations and promising outlier detection results without compromising classification accuracy.
Article
Agricultural applications such as yield prediction, precision agriculture and automated harvesting need systems able to infer the crop state from low-cost sensing devices. Proximal sensing using affordable cameras combined with computer vision has seen a promising alternative, strengthened after the advent of convolutional neural networks (CNNs) as an alternative for challenging pattern recognition problems in natural images. Considering fruit growing monitoring and automation, a fundamental problem is the detection, segmentation and counting of individual fruits in orchards. Here we show that for wine grapes, a crop presenting large variability in shape, color, size and compactness, grape clusters can be successfully detected, segmented and tracked using state-of-the-art CNNs. In a test set containing 408 grape clusters from images taken on a trellis-system based vineyard, we have reached an -score up to 0.91 for instance segmentation, a fine separation of each cluster from other structures in the image that allows a more accurate assessment of fruit size and shape. We have also shown as clusters can be identified and tracked along video sequences recording orchard rows. We also present a public dataset containing grape clusters properly annotated in 300 images and a novel annotation methodology for segmentation of complex objects in natural images. The presented pipeline for annotation, training, evaluation and tracking of agricultural patterns in images can be replicated for different crops and production systems. It can be employed in the development of sensing components for several agricultural and environmental applications.
Article
In this work, we present a new dataset to advance the state-of-the-art in fruit detection, segmentation, and counting in orchard environments. While there has been significant recent interest in solving these problems, the lack of a unified dataset has made it difficult to compare results. We hope to enable direct comparisons by providing a large variety of high-resolution images acquired in orchards, together with human annotations of the fruit on trees. The fruits are labeled using polygonal masks for each object instance to aid in precise object detection, localization, and segmentation. Additionally, we provide data for patch-based counting of clustered fruits. Our dataset contains over 41’0000 annotated object instances in 1000 images. We present a detailed overview of the dataset together with baseline performance analysis for bounding box detection, segmentation, and fruit counting as well as representative results for yield estimation. We make this dataset publicly available and host a CodaLab challenge to encourage a comparison of results on a common dataset. To download the data and learn more about the MinneApple dataset, please see the project website: http://rsn.cs.umn.edu/index.php/MinneApple . Up to date information is available online.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Semi-Supervised Classification (SSC), which makes use of both labeled and unlabeled data to determine classification borders in feature space, has great advantages in extracting classification information from mass data. In this paper, a novel SSC method based on Gaussian Mixture Model (GMM) is proposed, in which each class’s feature space is described by one GMM. Experiments show the proposed method can achieve high classification accuracy with small amount of labeled data. However, for the same accuracy, supervised classification methods such as Support Vector Machine, Object Oriented Classification, etc. should be provided with much more labeled data.
But that’s not why: Inference adjustment by interactive prototype deselection
  • Michael Gerstenberger
  • Sebastian Lapuschkin
  • Peter Eisert
  • Sebastian Bosse
This looks like that: deep learning for interpretable image recognition
  • Chaofan Chen
  • Oscar Li
  • Daniel Tao
  • Alina Barnett
  • Cynthia Rudin
  • Jonathan K Su
Semi-protopnet deep neural network for the classification of defective power grid distribution structures
  • Gurmail Stefano Frizzo Stefenon
  • Kin-Choong Singh
  • Alessandro Yow
  • Cimatti
But that’s not why: Inference adjustment by interactive prototype deselection
  • gerstenberger
This looks like that: deep learning for interpretable image recognition
  • chen
Minneapple: A benchmark dataset for apple detection and segmentation
  • häni