Conference PaperPDF Available

Few-shot Learning with Deep Triplet Networks for Brain Imaging Modality Recognition

Conference Paper

Few-shot Learning with Deep Triplet Networks for Brain Imaging Modality Recognition

Abstract and Figures

Image modality recognition is essential for efficient imaging workflows in current clinical environments, where multiple imaging modalities are used to better comprehend complex diseases. Emerging biomarkers from novel, rare modalities are being developed to aid in such understanding, however the availability of these images is often limited. This scenario raises the necessity of recognising new imaging modalities without them being collected and annotated in large amounts. In this work, we present a few-shot learning model for limited training examples based on Deep Triplet Networks. We show that the proposed model is more accurate in distinguishing different modalities than a traditional Convolutional Neural Network classifier when limited samples are available. Furthermore, we evaluate the performance of both classifiers when presented with noisy samples and provide an initial inspection of how the proposed model can incorporate measures of uncertainty to be more robust against out-of-sample examples.
Content may be subject to copyright.
Few-shot Learning with Deep Triplet Networks
for Brain Imaging Modality Recognition
Santi Puch, Irina S´anchez, and Matt Rowe
QMENTA, Boston, MA, United States
{santi,irina,matt}@qmenta.com
Abstract. Image modality recognition is essential for efficient imag-
ing workflows in current clinical environments, where multiple imaging
modalities are used to better comprehend complex diseases. Emerging
biomarkers from novel, rare modalities are being developed to aid in such
understanding, however the availability of these images is often limited.
This scenario raises the necessity of recognising new imaging modalities
without them being collected and annotated in large amounts. In this
work, we present a few-shot learning model for limited training examples
based on Deep Triplet Networks. We show that the proposed model is
more accurate in distinguishing different modalities than a traditional
Convolutional Neural Network classifier when limited samples are avail-
able. Furthermore, we evaluate the performance of both classifiers when
presented with noisy samples and provide an initial inspection of how
the proposed model can incorporate measures of uncertainty to be more
robust against out-of-sample examples.
Keywords: Brain imaging, Modality recognition, Few-shot learning,
Triplet loss, Uncertainty, Noise
1 Introduction
In recent decades, many useful imaging biomarkers have emerged from multiple
imaging modalities such as CT, PET, SPECT and MRI (and its many sub-
modalities) to assist with differential diagnosis, disease monitoring and measur-
ing the efficacy of pharmaceutical treatments. Diagnostic workflows and clini-
cal trials have therefore become dependent on the simultaneous use of multiple
modalities to augment the clinical understanding of complex diseases. This di-
versity of imaging modalities creates complexity for image archival systems such
as PACS, VNAs and cloud-based solutions, and the institutions or businesses
that use them.
Classification of modalities and sub-modalities is important for efficient imag-
ing workflows, a particularly difficult problem in MRI as the many distinct sub-
modalities are not differentiated in a simple and consistent manner by image
header information. For example, the use of contrast enhancing agents is a field
often accidentally omitted or improperly populated in DICOM headers, mean-
ing the use of a contrast enhancing agent can only be determined from the
2 S. Puch et al.
features of the image itself. In molecular imaging, an increasing variety of ra-
dioligands are being developed for monitoring different disease processes with
each having distinct patterns of uptake or deposition. A human expert can eas-
ily distinguish them by their distinct visual features, however, scanner, vendor
and center-specific idiosyncrasies in sequence implementation result in inconsis-
tencies in DICOM header information that make automatic classification from
DICOM headers alone highly challenging.
Due to the importance of the visual features of the images to classify, the
problem lends itself to Convolutional Neural Networks (CNNs), which have
proved to be highly successful at achieving near human-level performance at
classifying images based on visual features [2]. A challenge to using CNNs for
this kind of application is that they require large volumes of annotated data,
which can be difficult to obtain for novel imaging biomarkers or rare modalities.
For example, in a clinical trial utilising a novel imaging biomarker, it might be
difficult to collect more than a handful of examples of the associated imaging
sequence at startup. However, during the course of the trial, thousands of images
may be acquired, requiring specific expertise to properly classify each sequence.
Few-shot learning techniques offer a solution to creating robust classifiers from
a limited amount of training data.
In this paper, we propose a few-shot learning model based on Deep Triplet
Networks, capable of capturing the most relevant imaging features that enable
the differentiation between modalities even if the amount of training examples
is limited.
2 Methods
2.1 Data
We collect a brain imaging dataset that consists of 7 MRI sequences (T1, T2,
post-contrast T1, T2-FLAIR, PD, PASL and MRA), CT and FDG-PET imag-
ing, sourced from several public datasets that include brain scans from healthy
and diseased individuals. We consider two categories for these modalities: base
modalities, that includes T1, T2, CT and FDG-PET, and are the most abundant
and have the most distinctive imaging traits; and few-shot modalities, which in-
cludes T1-post, T2-FLAIR, PD, PASL and MRA modalities.
To train and evaluate the models, we extract 2D slices by sampling a normal
distribution centered around the middle slice of the brain along the sagittal,
coronal and axial axes. We sample 30874 slices of T1, 231759 of T2, 18541 of
CT, 15432 of FDG-PET, 8017 of T1-post, 9828 of T2-FLAIR, 8370 of PD, 5321
of PASL and 8462 of MRA images. We used 70% for training, 10% for evaluation
and 20% for test.
2.2 Deep Triplet Networks
We approach the few-shot learning problem with Triplet Networks [4]. A Triplet
Network is a type of metric learning algorithm designed to learn a metric embed-
ding φ(x) and a corresponding distance function d(x, x0) induced by a normed
Few-shot learning for brain imaging modality recognition 3
metric, so that given a triplet of samples (x, x+, x) and a similarity measure
r(x, x0) that satisfies r(x, x+)> r(x, x), the learned distance function satisfies
d(x, x+)< d(x, x). In essence, Triplet Networks learn to project samples in a
embedding space in which similar samples are closer and dissimilar samples are
farther apart with respect to a normed metric.
Fig. 1: A Deep Triplet Network takes an anchor, a positive and a negative sample,
computes their embeddings with a deep CNN and then learns a distance function
that satisfies the similarities between the samples of the triplet.
In our experimental setting, which corresponds to a multi-class image classi-
fication problem, the similarity measure r(x, x0) is defined by the labeling of our
samples, that is, r(x, x0) = 1 if xand x0belong to the same class and r(x, x0)=0
if xand x0belong to different classes. We define our distance function using the
L1normed metric as follows:
d(x, x0) = ||φ(x)φ(x0)||1(1)
where φ(x) is implemented with a deep CNN, hence the Deep Triplet Networks
naming. Typically, the samples of the triplet (x, x+, x) are referred to as anchor,
positive and negative; the anchor and positive samples belong to the same class,
while the negative sample belongs to a different class. A diagram of a Deep
Triplet Network is depicted in Figure 1.
2.3 Triplet loss with online hard-mining
The loss used to train Deep Triplet Networks, referred to as triplet loss, is defined
as follows:
L(x, x+, x) = max(d(x, x+)d(x, x)+m, 0) + λ(||x||2+||x+||2+||x||2) (2)
where mis a margin that controls how much farther apart do we want the
negative sample to be with respect to the anchor and positive sample, and λ
4 S. Puch et al.
is an hyperparameter that controls the amount of L2norm penalization of the
embedding vectors.
We implement an online hard-mining triplet loss, which has been shown to
be more efficient and help convergence [5]. Instead of computing the embeddings
on the whole training set in an offline fashion and then mine the hard triplets,
which satisfy d(x, x)< d(x, x+), we compute the embeddings on a mini-batch
of Bimages and then create a valid triplet with the hardest positive and the
hardest negative for each anchor within that mini-batch [3]. We choose a batch
size of B= 64 as it provides a good balance between memory demand and a
number of samples large enough to mine valid triplets among a variety of classes.
2.4 Pipeline for image classification with Deep Triplet Networks
Fig. 2: Diagram of the end-to-end pipeline for image classification with Deep
Triplet Networks
We propose a pipeline for medical image volume classification based on Deep
Triplet Networks. The pipeline, shown in Figure 2, starts with a preprocessing
and slice sampling step that normalizes the orientation and image intensities
of the volume and samples slices along the acquisition plane, emphasizing the
sampling density around the FOV center. Each slice is then passed through a
CNN that consists of a ResNet-50 [1] initialized with pre-trained weights from
ImageNet and trained with the triplet loss previously described. Then, the em-
bedding vectors extracted per slice are projected to a lower-dimensional space
using Principal Component Analysis (PCA), in order to remove the noisy com-
ponents of the embedded representation [6]. The PCA-projected embeddings are
then clustered with a Gaussian Mixture Model (GMM) via expectation maximi-
sation (EM). Unlike other clustering algorithms, such as k-means, a GMM is
capable of capturing non-spherical cluster structures and provides estimates of
the likelihood of a sample belonging to the model due to its probabilistic nature.
We set the number of components of the GMM equal to the number of classes,
and create a cluster to label mapping function by assigning to each cluster the
most common class. From a GMM we can extract the posterior probability of
Few-shot learning for brain imaging modality recognition 5
each slice, that is, the probability that a sample came from each of the compo-
nents of the mixture. We leverage that property to implement a hard decision
function in which each slice is assigned the class with maximum probability, and
the volume is classified by majority voting.
3 Experiments
3.1 Hyperparameter search
We use grid-search to obtain the optimal parameters of the model. The hyper-
parameters and options selected to optimize the network architecture are:
Optimizer: ADAM, SGD with Nesterov momentum.
Learning Rate: 1e3, 1e4, 1e5.
Learning Rate Decay: use a exponential decay with a decay rate of 0.9 every
1000 steps or not use decay.
The best performance was obtained when using SGD with Nesterov momentum
as optimizer, a learning rate of 1e3and learning rate decay. We set in all
experiments L2and L1regularization of weights to 1e5and 1e6respectively,
the margin mof the triplet loss to 2, the L2penalization of the embeddings
λto 0.05, and the dimension of the embedding space to 64. We also perform
random left-right and up-down flips as data augmentation. This configuration is
used to evaluate the performance of the proposed model in all the subsequent
experiments.
Furthermore, for each set of experiments, we evaluate the PCA projector
using different number of projection components in order to select the best con-
figuration. After evaluating the results, we select PCA with 9 components. The
GMM is configured so that each component has its own general covariance ma-
trix.
3.2 Few-shot learning
We compare the performance of our proposed Triplet Network (TN) classifier
against a standard CNN classifier when training with all the available data (exp1)
and training with restrictions on the number of slices of the few-shot classes
(exp2). In the latter, we restrict the number of slices of the few-shot classes
to only 150 slices, which corresponds to 10 volumes from which 5 slices have
been sampled for each of the 3 orthogonal axes. The CNN classifier is based
on the same architecture and pre-trained weights than the TN classifier, plus a
fully-connected layer to directly predict the class from the imaging data.
In Table 1 we present the class-wise average of the precision, recall and F1-
score, and the balanced accuracy for both experiments. The standard CNN clas-
sifier performs considerably well when trained with all the available data, but
is unable to capture the relevant imaging traits of the few-shot classes when
the training data is scarce. However, the TN classifier is able to produce an
embedding space (Figure 3) that separates the modalities into distinct clusters,
allowing a better classification despite the under-representation of some classes.
6 S. Puch et al.
Table 1: Classification metrics of the Triplet Network classifier (TN) and the
standard CNN classifier (CNN). B: base classes; F: few-shot classes.
Model Precision(B) Recall(B) F1-score(B) Precision(F) Recall(F) F1-score(F) Accuracy
CNN classifier - exp 1 0.98 0.99 0.9875 0.966 0.924 0.944 0.953
TN classifier - exp 1 1 0.938 0.965 0.89 0.996 0.93 0.971
CNN classifier - exp 2 0.782 0.995 0.887 1 0.332 0.396 0.626
TN classifier - exp 2 0.92 0.967 0.942 0.816 0.702 0.746 0.819
Fig. 3: Representation of the embedding space using the first three principal
components of the evaluation embedding’s projection on experiment 1 (left) and
experiment 2 (right). Orange: T2, brown: T1, blue: CT, red: FDG-PET, purple:
T2-FLAIR, yellow: T1-post, green: PD, pink: MRA, cyan: PASL.
3.3 Robustness against noise
We measure the robustness of both classifiers when the dataset is corrupted
with additive gaussian noise and salt and pepper noise. We consider the scenario
where the model has been trained with data that has been randomly corrupted
by noise and tested with corrupted samples (exp3), and the scenario where the
model has been trained with curated data but is also tested with corrupted
samples (exp4). Further, we also analyze the performance when limiting the
number of instances of the few-shot classes in both exp3 and exp4, as described
in the previous section.
In Table 2 we show the class-wise average of the precision, recall and F1-score,
and the balanced accuracy for the experiments where the data is corrupted with
additive gaussian noise and salt and pepper noise. When the noise applied is
additive gaussian, in both experiments and both scenarios (with and without
limiting few-shot classes), the TN classifier outperfoms the CNN classifier, thus
providing a more robust model. As expected, when the model has observed
samples corrupted with noise during the training process the performance is
better than when the training data is all curated. In the experiment using salt
and pepper noise, when we use randomly corrupted samples during training the
CNN classifier performs better than the TN classifier, but the results of the
network decrease considerably when the few-shot classes are limited, while our
proposed model is able to maintain a good performance. Both networks achieve
Few-shot learning for brain imaging modality recognition 7
bad results when are trained with curated data and tested with samples with
salt and pepper noise. It is interesting to observe that the performance of the
CNN classifier is similar with both types of noise, while the TN classifier has
decreased substantially its performance when the noise used is salt and pepper.
Table 2: Classification metrics of the few-shot classifier (Triplet) and the stan-
dard CNN classifier (Baseline) when trained with data corrupted noise (exp3)
and trained with curated data but tested with corrupted volumes (exp4). B:
base classes; F: few-shot classes.
Model Precision(B) Recall(B) F1-score(B) Precision(F) Recall(F) F1-score(F) Accuracy
Noise Gaussian
CNN classifier - exp 3 0.99 0.987 0.987 0.956 0.93 0.938 0.955
TN classifier - exp 3 0.992 0.947 0.97 0.888 0.942 0.902 0.97
CNN classifier limit - exp 3 0.815 0.997 0.887 1 0.328 0.4 0.625
TN classifier limit - exp 3 0.942 0.965 0.952 0.658 0.62 0.622 0.773
CNN classifier - exp 4 0.85 0.742 0.735 0.964 0.478 0.638 0.596
TN classifier - exp 4 0.992 0.687 0.787 0.754 0.774 0.682 0.737
CNN classifier limit - exp 4 0.725 10.817 0.732 0.742 0.194 0.29 0.47
TN classifier limit - exp 4 0.927 0.67 0.765 0.634 0.678 0.588 0.673
Model Precision(B) Recall(B) F1-score(B) Precision(F) Recall(F) F1-score(F) Accuracy
Noise Salt and pepper
CNN classifier - exp 3 0.982 0.985 0.982 0.946 0.916 0.93 0.947
TN classifier - exp 3 0.96 0.9375 0.945 0.658 0.738 0.668 0.827
CNN classifier limit - exp 3 0.765 0.99 0.87 0.914 0.31 0.384 0.625
TN classifier limit - exp 3 0.932 0.952 0.94 0.782 0.75 0.756 0.839
CNN classifier - exp 4 0.832 0.612 0.647 0.822 0.522 0.576 0.561
TN classifier - exp 4 0.912 0.49 0.6325 0.798 0.562 0.538 0.53
CNN classifier limit - exp 4 0.785 0.505 0.602 0.616 0.26 0.25 0.47
TN classifier limit - exp 4 0.722 0.49 0.545 0.64 0.396 0.448 0.44
3.4 Investigation of uncertainty measures
We investigate the use of the estimated log-likelihood of a sample on the GMM
model as a measure of uncertainty. To do so, we obtain a minimum log-likelihood
threshold by taking the 1st percentile over the training data, which corresponds
to a value of -12.44, and compare such threshold with the estimated log-likelihood
of samples: a) that come from one of the classes on our dataset; b) that come
from classes not represented in our dataset (e.g. volumes with binary masks or
derived images).
In Figure 4 we can see examples of the proposed experimental setting. We
observe that a sample from a class represented in the dataset (in our case, a
T1 volume from the test split) presents a log-likelihood value above the pro-
posed threshold. However, samples from classes not represented in the dataset
(concretely, a segmentation map, a filtered image and a probability map) have
a log-likelihood value lower than the proposed threshold.
This basic observation serves as an initial validation of the possibility of hav-
ing uncertainty estimates using the combination of a Deep Triplet Network and a
GMM model, thus having the capability of discerning out-of-sample modalities.
8 S. Puch et al.
Fig. 4: Three samples of classes not represented in our dataset and a T1 slice,
with their corresponding log-likelihood.
4 Conclusions
We have provided evidence that Deep Triplet Networks are a viable solution for
modality classification in a few-shot setting. The proposed model, when trained
with 30 times less instances of the rarer classes, surpasses substantially the per-
formance of a CNN classifier trained under the same conditions. We have also
concluded that the creation of an embedding space following a triplet network
strategy increases the robustness against noise when compared to a standard
CNN classifier. This is due to the fact that the results are not remarkably al-
tered when the data is corrupted, regardless of whether the model has been
trained with all the available samples or limiting the number of instances. Fi-
nally, we have explored the use of log-likelihood estimates of our model as a
measure of uncertainty by evaluating such measure on samples not belonging
to our dataset. We have found that this measure can effectively serve as an ini-
tial basis for uncertainty estimation, hence it can make our model more robust
to unseen examples. This observation is preliminary and further investigation
and development is required. Future work will focus on this topic, as well as
extending the proposed model to alternative problems, such as disease staging.
References
1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385
2. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In: Proceedings of the IEEE interna-
tional conference on computer vision. pp. 1026–1034 (2015)
3. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-
identification. ArXiv abs/1703.07737 (2017)
4. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: ICLR (2014)
5. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face
recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) pp. 815–823 (2015)
6. Wang, W., Carreira-Perpinan, M.A.: The role of dimensionality reduction in classi-
fication. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Article
Few-shot learning is an almost unexplored area in the field of medical image analysis. We propose a method for few-shot diagnosis of diseases and conditions from chest x-rays using discriminative ensemble learning. Our design involves a CNN-based coarse-learner in the first step to learn the general characteristics of chest x-rays. In the second step, we introduce a saliency-based classifier to extract disease-specific salient features from the output of the coarse-learner and classify based on the salient features. We propose a novel discriminative autoencoder ensemble to design the saliency-based classifier. The classification of the diseases is performed based on the salient features. Our algorithm proceeds through meta-training and meta-testing. During the training phase of meta-training, we train the coarse-learner. However, during the training phase of meta-testing, we train only the saliency-based classifier. Thus, our method is first-of-its-kind where the training phase of meta-training and the training phase of meta-testing are architecturally disjoint, making the method modular and easily adaptable to new tasks requiring the training of only the saliency-based classifier. Experiments show as high as ∼19% improvement in terms of F1 score compared to the baseline in the diagnosis of chest x-rays from publicly available datasets.
Article
Full-text available
In the past few years, the field of computer vision has gone through a revolution fueled mainly by the advent of large datasets and the adoption of deep convolutional neural networks for end-to-end learning. The person re-identification subfield is no exception to this, thanks to the notable publication of the Market-1501 and MARS datasets and several strong deep learning approaches. Unfortunately, a prevailing belief in the community seems to be that the triplet loss is inferior to using surrogate losses (classification, verification) followed by a separate metric learning step. We show that, for models trained from scratch as well as pretrained ones, using a variant of the triplet loss to perform end-to-end deep metric learning outperforms any other published method by a large margin.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Conference Paper
Deep learning has proven itself as a successful set of models for learning useful semantic representations of data. These, however, are mostly implicitly learned as part of a classification task. In this paper we propose the triplet network model, which aims to learn useful representations by distance comparisons. A similar model was defined by Wang et al. (2014), tailor made for learning a ranking for image information retrieval. Here we demonstrate using various datasets that our model learns a better representation than that of its immediate competitor, the Siamese network. We also discuss future possible usage as a framework for unsupervised learning
The role of dimensionality reduction in classification
  • W Wang
  • M A Carreira-Perpinan
Wang, W., Carreira-Perpinan, M.A.: The role of dimensionality reduction in classification. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)