Content uploaded by Daniel Grahn
Author content
All content in this area was uploaded by Daniel Grahn on May 05, 2021
Content may be subject to copyright.
mil-benchmarks: Standardized Evaluation of Deep
Multiple-Instance Learning Techniques
Daniel Grahn
Department of Computer Science and Engineering
Wright State University
Dayton, OH dan.grahn@wright.edu
Abstract—Multiple-instance learning is a subset of weakly
supervised learning where labels are applied to sets of instances
rather than the instances themselves. Under the standard as-
sumption, a set is positive only there is if at least one instance
in the set which is positive.
This paper introduces a series of multiple-instance learning
benchmarks generated from MNIST, Fashion-MNIST, and CI-
FAR10. These benchmarks test the standard, presence, absence,
and complex assumptions and provide a framework for future
benchmarks to be distributed. I implement and evaluate several
multiple-instance learning techniques against the benchmarks.
Further, I evaluate the Noisy-And method with label noise
and find mixed results with different datasets. The models are
implemented in TensorFlow 2.4.1 and are available on GitHub.
The benchmarks are available from PyPi as mil-benchmarks and
on GitHub.
Index Terms—Multiple-instance learning, Weakly supervised
learning, Artificial neural networks
I. INTRODUCTION
Supervised learning is a large and well-studied branch of
machine learning research. In the typical supervised learning
dataset, each input instance has a corresponding label. Models
are trained directly on the instances and labels. However,
there are many applications where providing a label for each
instance is not feasible due to data collection, scope, budget, or
constraints. In order to handle these cases, supervised learning
is weakened in different ways. One weakening of supervised
learning is Multiple-Instance Learning (MIL). MIL does not
provide labels for each input instance. Rather, instances are
grouped into sets, commonly referred to as bags. Each of these
bags is given a label based on its contents.
Babenko provides an useful example of MIL [1]. Consider
a group of faculty who each have a key ring with several keys
on it. You know which faculty are able to access a specific
lab and which are not. The task is to build a classifier which
can predict whether any given key, and therefore the key ring
containing it, grants access to the lab.
This example uses the standard assumption that a bag (key
ring) is positive (grants access to the lab) if at least one in-
stance (key) in the bag is positive. While not as common as the
standard assumption, other assumptions are commonly used
[2]. And while standard datasets exist for multiple instance
learning [3]–[5], they largely use the standard assumption. In
order to advance our understanding of MIL tasks, I propose a
Fig. 1. Example MIL Architecture
set of benchmarks which allow other assumptions to be tested
on multiple datasets.
Additionally, many of the methods of multiple-instance
learning are derived from traditional machine learning tech-
niques [6]. As deep learning continues to grow in popularity
and improve in performance, it will become more important to
be able to use and evaluate deep-learning techniques on MIL
tasks. To support this, I evaluate several MIL methods against
the benchmarks.
This paper implements models in TensorFlow 2.4.1 [7].
Section II describes the creation of the benchmark datasets.
Section III describes the MIL methods which have been
implemented. The evaluation experiments and their results are
presented and analyzed in section IV. And section V contains
a summary and directions for future work.
II. BENCHMARKS
Since traditional machine learning techniques work well
with tabular data, the benchmarks are sourced from three
computer vision datasets – MNIST [8], Fashion-MNIST [9],
and CIFAR-10 [10].
I generate the datasets using Algorithm 1. An effect of
this algorithm is that the datasets are not balanced. Table II
contains a detailed breakdown of the ratio of positive instances
in each benchmark dataset.
arXiv:2105.01443v1 [cs.LG] 4 May 2021
A. Standard
The standard assumption is that a bag will be labelled
as positive if it contains at least one instance of a specific
class of interest. The three selected datasets each have ten
different classes. The classes have different difficulties for
classification. To avoid choosing classes which are overly
easy or hard as the class of interest, I generate datasets for
each class. This provides 30 benchmarks with 10 from each
source. If computationally feasible, evaluators should use all
the benchmarks from a source and aggregate the results.
B. Presence & Absence
The presence assumption labels a bag as positive if any one
of the members of a subset are present in the bag. For instance,
a bag from MNIST may be positive if either a 0 or a 1 are
present. The benchmarks include 9 presence assumptions from
each dataset using classes of interest (0,1),(1,2)..(8,9).
The absence assumption is the negation of the presence.
A bag is positive if it does not include a subset of classes.
The benchmarks include 9 absence benchmarks for the same
classes of interest. Again, evaluators should prefer aggregate
results across these benchmarks.
C. Complex
Two datasets are included with a complex assumption and
are based on whether an entire outfit is present in a bag. Table I
contains the definitions of the outfits. The basic outfit assumes
that an outfit contains a top, bottom, and shoes but does not
require a handbag. I consider coats as tops because the two
classes are difficult to differentiate. I also included dresses
with the tops because dresses may be worn with bottoms. The
multi outfit is stricter. It allows for two types of outfits: a t-
shirt or shirt with trousers and sneakers or boots – or – a dress
with a bag and sandals or boots.
TABLE I
OUT FIT CLA SS ES
Outfit Contents
Basic (0 ∨2∨3∨4∨6) ∧(1) ∧(6 ∨7∨9)
Multi [(0 ∨6) ∧(1) ∧(7 ∨9)]∨
[(3) ∧(8) ∧(5 ∨9)]
III. MOD EL S
A. Baseline Model
Before tackling the multiple-instance learning techniques, I
train a baseline model which can separate MNIST, Fashion-
MNIST, and CIFAR-10. Perfect performance is not neces-
sary. Given my limited computational resources, it is more
important to have a lightweight model than state-of-the-art
performance.
To build this baseline model, I perform minimal modifica-
tion of the input data. The input images are normalized to
[0..1] rather than [0..255]. And the classes are converted from
nominal to one-hot representation. Table III shows the entire
Algorithm 1 Dataset Generation Algorithm
Input: xix (xindices), y,min bag = 3,max bag = 7
Output: xb,yb
1: xs, ysshuffle xix and y
2: xb, yb= [],[]
3: i= 0
4: while |xs| − i > max bag do
5: s=random in range [min bag , max bag]
6: l=label y[i..i +s]according to assumption
7: append xix[i..i +s]to xb
8: append lto yb
9: i=i+s
10: end while
11: l=label for y[i..]according to assumption
12: append xix[i..]to xb
13: append lto yb
14: return xb, yb
.
configuration of the baseline model. The model trained on
CIFAR-10 receives an F1 of 0.718 and an AUC of 0.963.
On Fashion-MNIST it reports an F1 of 0.921 and an AUC of
0.9961. And on MNIST, an F1 of 0.992 and an AUC of 0.999.
When run on CIFAR-10, the input shape is changed but all
other layers are left identical.
B. Adapting for MIL
Before running the baseline model with MIL methods, a
few modifications must be made. The simplest modification
is the number of output classes. Instead of having 10 output
classes, the MIL model has 2. This is changed in the baseline
model architecture.
The baseline model is designed to accept one image at
a time and produce a single layer. The MIL models must
accept 3..7different images and produce a single prediction.
To accomplish this, each bag is padded with zeros so that it
has the same size as a bag with seven instances. While this is
not strictly necessary, it allows training in batches rather than
on a single bag at a time.
Next, each layer from the baseline model is wrapped
in tensorflow.keras.layers.TimeDistributed.
This allows each instance within the bags to propagate through
the baseline network with identical weights. It minimizes the
number of parameters to be trained and also enforces that the
position of the instances in the bags is irrelevant.
With the above described modifications, the model makes
a prediction for each instance in the bag. The different MIL
methods used are focused on taking instance-level predictions
and merging them into a bag-level prediction.
C. MIL Models
I selected five MIL models to evaluate. They are:
1) Fully-Connected Layer
2) Max Pool [11]–[13]
3) Max Pool + Fully-Connected Layer
TABLE II
DATASET RATI OS
Assumption Positive If Source Num. Avg. Train Ratio Avg. Test Ratio
Standard xis present CIFAR-10 10 40.11% 40.20%
Fashion 10 40.25% 40.28%
MNIST 10 40.23% 40.26%
Presence xor x+ 1 is present CIFAR-10 965.33% 65.27%
Fashion 965.45% 65.55%
MNIST 965.56% 65.50%
Absence xand x+ 1 are absent CIFAR-10 934.68% 34.78%
Fashion 934.55% 34.45%
MNIST 934.44% 34.50%
Complex Contains ”Outfit” Fashion 225.51% 25.50%
TABLE III
BASELINE MOD EL ARCHITECTURE
Layer Output Shape # of Params Details
Convolution (28, 28, 64) 320 filters=64, kernel size=2, padding=’same’, activation=’relu’
Max Pool (14, 14, 64) 0 pool size=2
Convolution (12, 12, 32) 18,464 filters=32, kernel size=3, padding=’valid’, activation=’relu’
Max Pool (6, 6, 32) 0 pool size=2
Dropout (6, 6, 32) 0 dropout=0.3
Flatten (1152) 0
Dense (256) 295,168 activation=’relu’
Dropout (256) 0 dropout=0.5
Dense (10) 2,570 activation=’softmax’
Total Parameters: 316,522
Optimizer: Adam (default settings)
Loss: Categorical Cross-Entropy
Epochs: 10
4) Avg Pool + Fully-Connected Layer
5) Noisy-And [14]
For Model 1 (Fully Connected), the instance predictions
are flattened and fully connected to the outputs. Model 2
(Max Pool) simply applies a 1D max pooling to the output
predictions. No additional weights are applied. Model 3 is an
extension of Model 2 where a fully-connected layer is added
after the max pool. Model 4 modifies Model 3 by replacing
the Max Pooling with an average pooling. Finally, Model 5
uses a custom layer to apply the noisy-And function defined
in Equation 1.
Pi=gi({pij }) = σ(a(pij −bi)) −σ(−abi)
σ(a(1 −bi)) −σ(−abi)(1)
Where pij =1
|j|X
j
pij (2)
The Noisy-And function imitates a probabilistic And. It
triggers a bag-level positive prediction (Pi) when the mean of
instance-level probabilities (Pij) crosses a learned threshold.
acontrols the slope of the activation function. The authors
find the best performance at a= 10 which we adopt. biare
parameters which adjust the threshold for each class.1
To keep the training as consistent as possible, random seeds
are always set to 42 [15] and no modifications are made to
loss functions, optimizers, data preprocessing, or any other
hyperparameters.
IV. EXP ER IM EN TS
A. MIL Task
Table IV contains metrics for each model trained on each
dataset. For many of the benchmarks, Model 1 fails to learn.
It ends up predicting negative for all bags. This failure to
converge even happens within datasets and assumptions. For
instance, the model learns the standard assumption for MNIST
when the class of interest is [2,3,4,6,7,8,9] but not for
[0,1,5]. The model learned the multi-outfit assumption but
failed on the basic outfit. This reflects an instability in the
method. It fails to learn the Basic Outfit benchmark at all.
Model 2 (Max Pool) was able to learn from the benchmarks.
However, it was not always able to generalize well. Several of
the benchmarks posted high training scores but low test scores.
Additionally, there was an epoch to epoch stochasticity in the
F1 and AUC metrics rather than a (more or less) monotonically
1Under some training circumstances, predictions for this layer fall outside
the [0..1] prediction range. The predictions are clipped to avoid this.
TABLE IV
TES T RESU LT METR ICS
Assumption Source Fully-Connected Max Pool Max Pool + FC Avg Pool + FC Noisy-And
F1 AUC F1 AUC F1 AUC F1 AUC F1 AUC
Standard CIFAR-10 0.533 0.679 0.546 0.701 0.492 0.663 0.476 0.662 0.766 0.851
Fashion 0.848 0.849 0.632 0.770 0.728 0.820 0.550 0.724 0.935 0.972
MNIST 0.803 0.881 0.742 0.795 0.669 0.769 0.660 0.797 0.991 0.997
Presence CIFAR-10 0.539 0.712 0.463 0.710 0.494 0.716 0.519 0.715 0.624 0.776
Fashion 0.701 0.834 0.900 0.829 0.504 0.734 0.635 0.795 0.894 0.954
MNIST 0.824 0.905 0.975 0.873 0.566 0.768 0.546 0.760 0.981 0.995
Absence CIFAR-10 0.572 0.732 0.462 0.712 0.486 0.707 0.491 0.714 0.622 0.774
Fashion 0.675 0.810 0.899 0.830 0.518 0.738 0.667 0.818 0.889 0.952
MNIST 0.867 0.930 0.971 0.855 0.658 0.818 0.586 0.774 0.980 0.996
Complex Basic Outfit 0.416 0.713 0.798 0.764 0.416 0.713 0.416 0.713 0.867 0.948
Complex Multi Outfit 0.849 0.944 0.764 0.799 0.416 0.713 0.416 0.713 0.867 0.945
Fig. 2. Noisy-And Performance on Noisy Labels
increasing value. Despite posting lower scores than Model 1,
Model 2 learns more consistently.
When a dense layer is added after the pooling layer (Model
3), the downsides of both Model 1 and Model 2 are dimin-
ished, but are not entirely mitigated. Model 3 fails to converge
on more datasets than Model 2, but does have significantly
less of a gap between training and test metrics in most
cases. However, the model fails to learn from the Complex
benchmark and posts the minimum scores of 0.416 F1 and
0.713 AUC.
Swapping the max pool layer for an avg pool (Model 4)
has mixed results. Sometimes it performs better, sometimes it
performs worse. In most cases it has no effect. On some MIL
tasks it may be useful to swap Max for Avg pooling layers in
an attempt to improve performance. Again, the model failed
to learn from the Complex benchmark.
Finally, Model 5 (Noisy-And) is extremely successful. It
learns well on all of the datasets and posts the highest
results on nearly all benchmarks. The layer has been designed
specifically to solve MIL tasks and its results show that well.
B. MIL with Noise
Model 5 performs well on all of the datasets. In order to
better understand the capabilities of the Noisy-And network,
I adapt the benchmarks to include noise in the labels. This is
accomplished by randomly flipping classes with a probability
equal to the level of noise. I then retrain the model on noise
levels [0,0.45] incremented by 0.05 with a single benchmark
from the MNIST and CIFAR-10 sources (classes 0 and 5
respectively) and assess the performance at each step. AUC
is an appropriate metric for this evaluation as it is insensitive
to changes in class distribution [16]. The results are displayed
in Figure 2.
For the MNIST dataset, the model learns with only slight
degradation of performance until the noise level passes 0.35.
Even with 45% of the labels flipped, the model receives an
AUC of 0.706. Training on the CIFAR-10 dataset produces
different results. The AUC degrades roughly in line with
the amount of noise which has been added to the labels. I
hypothesize that the baseline network with the Noisy-And
layer is extremely capable of classifying MNIST digits and
thus can handle large amounts of noise. On the other hand,
the model scores lower on CIFAR-10 without label noise and
because of this we see the drop in AUC right away. More
testing is necessary to confirm his hypothesis.
V. CONCLUSION
This paper makes three contributions. First, it creates and
distributes 86 MIL benchmark datasets. These datasets span
three sources and provide a means to benchmark assumptions
which are not covered by existing datasets. In addition, I
have released a Python module which provides easy access
to these datasets and is available on PyPi as mil-benchmarks
at https://pypi.org/project/mil-benchmarks/. This module pro-
vides a platform for distribution of future MIL benchmarks.
Currently, the mil-benchmarks module stores datasets as
CSVs. Future work may reduce the storage size of the module
by recreating these CSVs using reproducible random numbers.
Additional benchmarks may be created from more complicated
source data, such as:
1) CIFAR-100 [10]
2) Street View House Numbers [17]
3) miniImageNet [18]
Second, five methods were evaluated against the MIL
benchmarks. These evaluations show the inadequacy of stan-
dard fully-connected and standard pooling layers for use in
MIL task domain. But, they show that the Noisy-And MIL
layer performs much better while using the same underlying
classification model.
Future research may explore more assumptions under which
bags may be labelled. Assumptions of interest include labeling
bags as positive if:
1) A class’s representation crosses some threshold.
2) One or another class is present, but not both.
3) More complicated boolean expressions.
Finally, Noisy-And was found to produce mixed results
under different circumstances. I hypothesize that this is related
to the performance of the model on a noise-free dataset be
highly-effective in the presence of noisy labels.
The models, layer implementations, and Jupyter Notebooks
to reproduce the results of this paper are available on GitHub
at https://github.com/dgrahn/deep-mil.
REFERENCES
[1] B. Babenko, “Multiple instance learning: Algorithms
and applications,” View Article PubMed/NCBI Google
Scholar, pp. 1–19, 2008.
[2] J. R. Foulds and E. Frank, “A review of multi-instance
learning assumptions,” 2010.
[3] T. G. Dietterich, R. H. Lathrop, and T. Lozano-
P´
erez, “Solving the multiple instance problem with
axis-parallel rectangles,” Artificial intelligence, vol. 89,
no. 1-2, pp. 31–71, 1997.
[4] B. Behmardi, F. Briggs, X. Z. Fern, and R. Raich,
“Confidence-constrained maximum entropy framework
for learning from multi-instance data,” arXiv preprint
arXiv:1603.01901, 2016.
[5] M. Paramanandam, M. O’Byrne, B. Ghosh, J. J. Mam-
men, M. T. Manipadam, R. Thamburaj, and V. Pakrashi,
“Automated segmentation of nuclei in breast cancer
histopathology images,” PloS one, vol. 11, no. 9,
e0162053, 2016.
[6] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G.
Gagnon, “Multiple instance learning: A survey of prob-
lem characteristics and applications,” Pattern Recogni-
tion, vol. 77, pp. 329–353, 2018.
[7] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z.
Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, et al., “Tensorflow: Large-scale machine
learning on heterogeneous distributed systems,” arXiv
preprint arXiv:1603.04467, 2016.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
“Gradient-based learning applied to document recog-
nition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[9] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: A
novel image dataset for benchmarking machine learning
algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[10] A. Krizhevsky, G. Hinton, et al., “Learning multiple
layers of features from tiny images,” 2009.
[11] T.-W. Su, J.-Y. Liu, and Y.-H. Yang, “Weakly-
supervised audio event detection using event-specific
gaussian filters and fully convolutional networks,”
in 2017 IEEE international conference on acoustics,
speech and signal processing (ICASSP), IEEE, 2017,
pp. 791–795.
[12] A. Kumar and B. Raj, “Audio event detection us-
ing weakly labeled data,” in Proceedings of the 24th
ACM international conference on Multimedia, 2016,
pp. 1038–1047.
[13] J. Salamon, B. McFee, P. Li, and J. P. Bello, “Dcase
2017 submission: Multiple instance learning for sound
event detection,” Detection and Classification of Acous-
tic Scenes and Events, vol. 2017, 2017.
[14] O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying
and segmenting microscopy images with deep multi-
ple instance learning,” Bioinformatics, vol. 32, no. 12,
pp. i52–i59, 2016.
[15] D. Adams, The Ultimate Hitchhiker’s Guide to the
Galaxy: The Complete Trilogy in Five Parts. Pan
Macmillan, 2017, vol. 6.
[16] P. A. Flach and M. Kull, “Precision-recall-gain curves:
Pr analysis done right.,” in NIPS, vol. 15, 2015.
[17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu,
and A. Y. Ng, “Reading digits in natural images with
unsupervised feature learning,” 2011.
[18] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu,
and D. Wierstra, “Matching networks for one shot
learning,” arXiv preprint arXiv:1606.04080, 2016.