PreprintPDF Available

mil-benchmarks: Standardized Evaluation of Deep Multiple-Instance Learning Techniques

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Multiple-instance learning is a subset of weakly supervised learning where labels are applied to sets of instances rather than the instances themselves. Under the standard assumption, a set is positive only there is if at least one instance in the set which is positive. This paper introduces a series of multiple-instance learning benchmarks generated from MNIST, Fashion-MNIST, and CIFAR10. These benchmarks test the standard, presence, absence, and complex assumptions and provide a framework for future benchmarks to be distributed. I implement and evaluate several multiple-instance learning techniques against the benchmarks. Further, I evaluate the Noisy-And method with label noise and find mixed results with different datasets. The models are implemented in TensorFlow 2.4.1 and are available on GitHub. The benchmarks are available from PyPi as mil-benchmarks and on GitHub.
Content may be subject to copyright.
mil-benchmarks: Standardized Evaluation of Deep
Multiple-Instance Learning Techniques
Daniel Grahn
Department of Computer Science and Engineering
Wright State University
Dayton, OH dan.grahn@wright.edu
Abstract—Multiple-instance learning is a subset of weakly
supervised learning where labels are applied to sets of instances
rather than the instances themselves. Under the standard as-
sumption, a set is positive only there is if at least one instance
in the set which is positive.
This paper introduces a series of multiple-instance learning
benchmarks generated from MNIST, Fashion-MNIST, and CI-
FAR10. These benchmarks test the standard, presence, absence,
and complex assumptions and provide a framework for future
benchmarks to be distributed. I implement and evaluate several
multiple-instance learning techniques against the benchmarks.
Further, I evaluate the Noisy-And method with label noise
and find mixed results with different datasets. The models are
implemented in TensorFlow 2.4.1 and are available on GitHub.
The benchmarks are available from PyPi as mil-benchmarks and
on GitHub.
Index Terms—Multiple-instance learning, Weakly supervised
learning, Artificial neural networks
I. INTRODUCTION
Supervised learning is a large and well-studied branch of
machine learning research. In the typical supervised learning
dataset, each input instance has a corresponding label. Models
are trained directly on the instances and labels. However,
there are many applications where providing a label for each
instance is not feasible due to data collection, scope, budget, or
constraints. In order to handle these cases, supervised learning
is weakened in different ways. One weakening of supervised
learning is Multiple-Instance Learning (MIL). MIL does not
provide labels for each input instance. Rather, instances are
grouped into sets, commonly referred to as bags. Each of these
bags is given a label based on its contents.
Babenko provides an useful example of MIL [1]. Consider
a group of faculty who each have a key ring with several keys
on it. You know which faculty are able to access a specific
lab and which are not. The task is to build a classifier which
can predict whether any given key, and therefore the key ring
containing it, grants access to the lab.
This example uses the standard assumption that a bag (key
ring) is positive (grants access to the lab) if at least one in-
stance (key) in the bag is positive. While not as common as the
standard assumption, other assumptions are commonly used
[2]. And while standard datasets exist for multiple instance
learning [3]–[5], they largely use the standard assumption. In
order to advance our understanding of MIL tasks, I propose a
Fig. 1. Example MIL Architecture
set of benchmarks which allow other assumptions to be tested
on multiple datasets.
Additionally, many of the methods of multiple-instance
learning are derived from traditional machine learning tech-
niques [6]. As deep learning continues to grow in popularity
and improve in performance, it will become more important to
be able to use and evaluate deep-learning techniques on MIL
tasks. To support this, I evaluate several MIL methods against
the benchmarks.
This paper implements models in TensorFlow 2.4.1 [7].
Section II describes the creation of the benchmark datasets.
Section III describes the MIL methods which have been
implemented. The evaluation experiments and their results are
presented and analyzed in section IV. And section V contains
a summary and directions for future work.
II. BENCHMARKS
Since traditional machine learning techniques work well
with tabular data, the benchmarks are sourced from three
computer vision datasets – MNIST [8], Fashion-MNIST [9],
and CIFAR-10 [10].
I generate the datasets using Algorithm 1. An effect of
this algorithm is that the datasets are not balanced. Table II
contains a detailed breakdown of the ratio of positive instances
in each benchmark dataset.
arXiv:2105.01443v1 [cs.LG] 4 May 2021
A. Standard
The standard assumption is that a bag will be labelled
as positive if it contains at least one instance of a specific
class of interest. The three selected datasets each have ten
different classes. The classes have different difficulties for
classification. To avoid choosing classes which are overly
easy or hard as the class of interest, I generate datasets for
each class. This provides 30 benchmarks with 10 from each
source. If computationally feasible, evaluators should use all
the benchmarks from a source and aggregate the results.
B. Presence & Absence
The presence assumption labels a bag as positive if any one
of the members of a subset are present in the bag. For instance,
a bag from MNIST may be positive if either a 0 or a 1 are
present. The benchmarks include 9 presence assumptions from
each dataset using classes of interest (0,1),(1,2)..(8,9).
The absence assumption is the negation of the presence.
A bag is positive if it does not include a subset of classes.
The benchmarks include 9 absence benchmarks for the same
classes of interest. Again, evaluators should prefer aggregate
results across these benchmarks.
C. Complex
Two datasets are included with a complex assumption and
are based on whether an entire outfit is present in a bag. Table I
contains the definitions of the outfits. The basic outfit assumes
that an outfit contains a top, bottom, and shoes but does not
require a handbag. I consider coats as tops because the two
classes are difficult to differentiate. I also included dresses
with the tops because dresses may be worn with bottoms. The
multi outfit is stricter. It allows for two types of outfits: a t-
shirt or shirt with trousers and sneakers or boots – or – a dress
with a bag and sandals or boots.
TABLE I
OUT FIT CLA SS ES
Outfit Contents
Basic (0 2346) (1) (6 79)
Multi [(0 6) (1) (7 9)]
[(3) (8) (5 9)]
III. MOD EL S
A. Baseline Model
Before tackling the multiple-instance learning techniques, I
train a baseline model which can separate MNIST, Fashion-
MNIST, and CIFAR-10. Perfect performance is not neces-
sary. Given my limited computational resources, it is more
important to have a lightweight model than state-of-the-art
performance.
To build this baseline model, I perform minimal modifica-
tion of the input data. The input images are normalized to
[0..1] rather than [0..255]. And the classes are converted from
nominal to one-hot representation. Table III shows the entire
Algorithm 1 Dataset Generation Algorithm
Input: xix (xindices), y,min bag = 3,max bag = 7
Output: xb,yb
1: xs, ysshuffle xix and y
2: xb, yb= [],[]
3: i= 0
4: while |xs| − i > max bag do
5: s=random in range [min bag , max bag]
6: l=label y[i..i +s]according to assumption
7: append xix[i..i +s]to xb
8: append lto yb
9: i=i+s
10: end while
11: l=label for y[i..]according to assumption
12: append xix[i..]to xb
13: append lto yb
14: return xb, yb
.
configuration of the baseline model. The model trained on
CIFAR-10 receives an F1 of 0.718 and an AUC of 0.963.
On Fashion-MNIST it reports an F1 of 0.921 and an AUC of
0.9961. And on MNIST, an F1 of 0.992 and an AUC of 0.999.
When run on CIFAR-10, the input shape is changed but all
other layers are left identical.
B. Adapting for MIL
Before running the baseline model with MIL methods, a
few modifications must be made. The simplest modification
is the number of output classes. Instead of having 10 output
classes, the MIL model has 2. This is changed in the baseline
model architecture.
The baseline model is designed to accept one image at
a time and produce a single layer. The MIL models must
accept 3..7different images and produce a single prediction.
To accomplish this, each bag is padded with zeros so that it
has the same size as a bag with seven instances. While this is
not strictly necessary, it allows training in batches rather than
on a single bag at a time.
Next, each layer from the baseline model is wrapped
in tensorflow.keras.layers.TimeDistributed.
This allows each instance within the bags to propagate through
the baseline network with identical weights. It minimizes the
number of parameters to be trained and also enforces that the
position of the instances in the bags is irrelevant.
With the above described modifications, the model makes
a prediction for each instance in the bag. The different MIL
methods used are focused on taking instance-level predictions
and merging them into a bag-level prediction.
C. MIL Models
I selected five MIL models to evaluate. They are:
1) Fully-Connected Layer
2) Max Pool [11]–[13]
3) Max Pool + Fully-Connected Layer
TABLE II
DATASET RATI OS
Assumption Positive If Source Num. Avg. Train Ratio Avg. Test Ratio
Standard xis present CIFAR-10 10 40.11% 40.20%
Fashion 10 40.25% 40.28%
MNIST 10 40.23% 40.26%
Presence xor x+ 1 is present CIFAR-10 965.33% 65.27%
Fashion 965.45% 65.55%
MNIST 965.56% 65.50%
Absence xand x+ 1 are absent CIFAR-10 934.68% 34.78%
Fashion 934.55% 34.45%
MNIST 934.44% 34.50%
Complex Contains ”Outfit” Fashion 225.51% 25.50%
TABLE III
BASELINE MOD EL ARCHITECTURE
Layer Output Shape # of Params Details
Convolution (28, 28, 64) 320 filters=64, kernel size=2, padding=’same’, activation=’relu’
Max Pool (14, 14, 64) 0 pool size=2
Convolution (12, 12, 32) 18,464 filters=32, kernel size=3, padding=’valid’, activation=’relu’
Max Pool (6, 6, 32) 0 pool size=2
Dropout (6, 6, 32) 0 dropout=0.3
Flatten (1152) 0
Dense (256) 295,168 activation=’relu’
Dropout (256) 0 dropout=0.5
Dense (10) 2,570 activation=’softmax’
Total Parameters: 316,522
Optimizer: Adam (default settings)
Loss: Categorical Cross-Entropy
Epochs: 10
4) Avg Pool + Fully-Connected Layer
5) Noisy-And [14]
For Model 1 (Fully Connected), the instance predictions
are flattened and fully connected to the outputs. Model 2
(Max Pool) simply applies a 1D max pooling to the output
predictions. No additional weights are applied. Model 3 is an
extension of Model 2 where a fully-connected layer is added
after the max pool. Model 4 modifies Model 3 by replacing
the Max Pooling with an average pooling. Finally, Model 5
uses a custom layer to apply the noisy-And function defined
in Equation 1.
Pi=gi({pij }) = σ(a(pij bi)) σ(abi)
σ(a(1 bi)) σ(abi)(1)
Where pij =1
|j|X
j
pij (2)
The Noisy-And function imitates a probabilistic And. It
triggers a bag-level positive prediction (Pi) when the mean of
instance-level probabilities (Pij) crosses a learned threshold.
acontrols the slope of the activation function. The authors
find the best performance at a= 10 which we adopt. biare
parameters which adjust the threshold for each class.1
To keep the training as consistent as possible, random seeds
are always set to 42 [15] and no modifications are made to
loss functions, optimizers, data preprocessing, or any other
hyperparameters.
IV. EXP ER IM EN TS
A. MIL Task
Table IV contains metrics for each model trained on each
dataset. For many of the benchmarks, Model 1 fails to learn.
It ends up predicting negative for all bags. This failure to
converge even happens within datasets and assumptions. For
instance, the model learns the standard assumption for MNIST
when the class of interest is [2,3,4,6,7,8,9] but not for
[0,1,5]. The model learned the multi-outfit assumption but
failed on the basic outfit. This reflects an instability in the
method. It fails to learn the Basic Outfit benchmark at all.
Model 2 (Max Pool) was able to learn from the benchmarks.
However, it was not always able to generalize well. Several of
the benchmarks posted high training scores but low test scores.
Additionally, there was an epoch to epoch stochasticity in the
F1 and AUC metrics rather than a (more or less) monotonically
1Under some training circumstances, predictions for this layer fall outside
the [0..1] prediction range. The predictions are clipped to avoid this.
TABLE IV
TES T RESU LT METR ICS
Assumption Source Fully-Connected Max Pool Max Pool + FC Avg Pool + FC Noisy-And
F1 AUC F1 AUC F1 AUC F1 AUC F1 AUC
Standard CIFAR-10 0.533 0.679 0.546 0.701 0.492 0.663 0.476 0.662 0.766 0.851
Fashion 0.848 0.849 0.632 0.770 0.728 0.820 0.550 0.724 0.935 0.972
MNIST 0.803 0.881 0.742 0.795 0.669 0.769 0.660 0.797 0.991 0.997
Presence CIFAR-10 0.539 0.712 0.463 0.710 0.494 0.716 0.519 0.715 0.624 0.776
Fashion 0.701 0.834 0.900 0.829 0.504 0.734 0.635 0.795 0.894 0.954
MNIST 0.824 0.905 0.975 0.873 0.566 0.768 0.546 0.760 0.981 0.995
Absence CIFAR-10 0.572 0.732 0.462 0.712 0.486 0.707 0.491 0.714 0.622 0.774
Fashion 0.675 0.810 0.899 0.830 0.518 0.738 0.667 0.818 0.889 0.952
MNIST 0.867 0.930 0.971 0.855 0.658 0.818 0.586 0.774 0.980 0.996
Complex Basic Outfit 0.416 0.713 0.798 0.764 0.416 0.713 0.416 0.713 0.867 0.948
Complex Multi Outfit 0.849 0.944 0.764 0.799 0.416 0.713 0.416 0.713 0.867 0.945
Fig. 2. Noisy-And Performance on Noisy Labels
increasing value. Despite posting lower scores than Model 1,
Model 2 learns more consistently.
When a dense layer is added after the pooling layer (Model
3), the downsides of both Model 1 and Model 2 are dimin-
ished, but are not entirely mitigated. Model 3 fails to converge
on more datasets than Model 2, but does have significantly
less of a gap between training and test metrics in most
cases. However, the model fails to learn from the Complex
benchmark and posts the minimum scores of 0.416 F1 and
0.713 AUC.
Swapping the max pool layer for an avg pool (Model 4)
has mixed results. Sometimes it performs better, sometimes it
performs worse. In most cases it has no effect. On some MIL
tasks it may be useful to swap Max for Avg pooling layers in
an attempt to improve performance. Again, the model failed
to learn from the Complex benchmark.
Finally, Model 5 (Noisy-And) is extremely successful. It
learns well on all of the datasets and posts the highest
results on nearly all benchmarks. The layer has been designed
specifically to solve MIL tasks and its results show that well.
B. MIL with Noise
Model 5 performs well on all of the datasets. In order to
better understand the capabilities of the Noisy-And network,
I adapt the benchmarks to include noise in the labels. This is
accomplished by randomly flipping classes with a probability
equal to the level of noise. I then retrain the model on noise
levels [0,0.45] incremented by 0.05 with a single benchmark
from the MNIST and CIFAR-10 sources (classes 0 and 5
respectively) and assess the performance at each step. AUC
is an appropriate metric for this evaluation as it is insensitive
to changes in class distribution [16]. The results are displayed
in Figure 2.
For the MNIST dataset, the model learns with only slight
degradation of performance until the noise level passes 0.35.
Even with 45% of the labels flipped, the model receives an
AUC of 0.706. Training on the CIFAR-10 dataset produces
different results. The AUC degrades roughly in line with
the amount of noise which has been added to the labels. I
hypothesize that the baseline network with the Noisy-And
layer is extremely capable of classifying MNIST digits and
thus can handle large amounts of noise. On the other hand,
the model scores lower on CIFAR-10 without label noise and
because of this we see the drop in AUC right away. More
testing is necessary to confirm his hypothesis.
V. CONCLUSION
This paper makes three contributions. First, it creates and
distributes 86 MIL benchmark datasets. These datasets span
three sources and provide a means to benchmark assumptions
which are not covered by existing datasets. In addition, I
have released a Python module which provides easy access
to these datasets and is available on PyPi as mil-benchmarks
at https://pypi.org/project/mil-benchmarks/. This module pro-
vides a platform for distribution of future MIL benchmarks.
Currently, the mil-benchmarks module stores datasets as
CSVs. Future work may reduce the storage size of the module
by recreating these CSVs using reproducible random numbers.
Additional benchmarks may be created from more complicated
source data, such as:
1) CIFAR-100 [10]
2) Street View House Numbers [17]
3) miniImageNet [18]
Second, five methods were evaluated against the MIL
benchmarks. These evaluations show the inadequacy of stan-
dard fully-connected and standard pooling layers for use in
MIL task domain. But, they show that the Noisy-And MIL
layer performs much better while using the same underlying
classification model.
Future research may explore more assumptions under which
bags may be labelled. Assumptions of interest include labeling
bags as positive if:
1) A class’s representation crosses some threshold.
2) One or another class is present, but not both.
3) More complicated boolean expressions.
Finally, Noisy-And was found to produce mixed results
under different circumstances. I hypothesize that this is related
to the performance of the model on a noise-free dataset be
highly-effective in the presence of noisy labels.
The models, layer implementations, and Jupyter Notebooks
to reproduce the results of this paper are available on GitHub
at https://github.com/dgrahn/deep-mil.
REFERENCES
[1] B. Babenko, “Multiple instance learning: Algorithms
and applications,” View Article PubMed/NCBI Google
Scholar, pp. 1–19, 2008.
[2] J. R. Foulds and E. Frank, “A review of multi-instance
learning assumptions,” 2010.
[3] T. G. Dietterich, R. H. Lathrop, and T. Lozano-
P´
erez, “Solving the multiple instance problem with
axis-parallel rectangles,” Artificial intelligence, vol. 89,
no. 1-2, pp. 31–71, 1997.
[4] B. Behmardi, F. Briggs, X. Z. Fern, and R. Raich,
“Confidence-constrained maximum entropy framework
for learning from multi-instance data,” arXiv preprint
arXiv:1603.01901, 2016.
[5] M. Paramanandam, M. O’Byrne, B. Ghosh, J. J. Mam-
men, M. T. Manipadam, R. Thamburaj, and V. Pakrashi,
Automated segmentation of nuclei in breast cancer
histopathology images,” PloS one, vol. 11, no. 9,
e0162053, 2016.
[6] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G.
Gagnon, “Multiple instance learning: A survey of prob-
lem characteristics and applications,” Pattern Recogni-
tion, vol. 77, pp. 329–353, 2018.
[7] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z.
Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, et al., “Tensorflow: Large-scale machine
learning on heterogeneous distributed systems,arXiv
preprint arXiv:1603.04467, 2016.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
“Gradient-based learning applied to document recog-
nition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[9] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: A
novel image dataset for benchmarking machine learning
algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[10] A. Krizhevsky, G. Hinton, et al., “Learning multiple
layers of features from tiny images,” 2009.
[11] T.-W. Su, J.-Y. Liu, and Y.-H. Yang, “Weakly-
supervised audio event detection using event-specific
gaussian filters and fully convolutional networks,”
in 2017 IEEE international conference on acoustics,
speech and signal processing (ICASSP), IEEE, 2017,
pp. 791–795.
[12] A. Kumar and B. Raj, “Audio event detection us-
ing weakly labeled data,” in Proceedings of the 24th
ACM international conference on Multimedia, 2016,
pp. 1038–1047.
[13] J. Salamon, B. McFee, P. Li, and J. P. Bello, “Dcase
2017 submission: Multiple instance learning for sound
event detection,Detection and Classification of Acous-
tic Scenes and Events, vol. 2017, 2017.
[14] O. Z. Kraus, J. L. Ba, and B. J. Frey, “Classifying
and segmenting microscopy images with deep multi-
ple instance learning,” Bioinformatics, vol. 32, no. 12,
pp. i52–i59, 2016.
[15] D. Adams, The Ultimate Hitchhiker’s Guide to the
Galaxy: The Complete Trilogy in Five Parts. Pan
Macmillan, 2017, vol. 6.
[16] P. A. Flach and M. Kull, “Precision-recall-gain curves:
Pr analysis done right.,” in NIPS, vol. 15, 2015.
[17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu,
and A. Y. Ng, “Reading digits in natural images with
unsupervised feature learning,” 2011.
[18] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu,
and D. Wierstra, “Matching networks for one shot
learning,” arXiv preprint arXiv:1606.04080, 2016.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.
Conference Paper
Full-text available
Acoustic event detection is essential for content analysis and description of multimedia recordings. The majority of current literature on the topic learns the detectors through fully-supervised techniques employing strongly labeled data. However, the labels available for online multimedia data are generally weak and do not provide sufficient detail for such methods to be employed. In this paper we propose a framework for learning acoustic event detectors using only weakly labeled data based on a Multiple Instance Learning (MIL) framework. We first show that audio event detection using weak data can be formulated as an MIL problem. We then suggest two frameworks for solving multiple-instance learning, one based on neural networks, and the second on support vector machines. The proposed methods can help in removing the time consuming and expensive process of manually annotating data to facilitate fully supervised learning. Our proposed framework can not only successfully detect events in a recording but can also provide temporal locations of events in the recording. This is interesting as these information were never known in the first place for weakly labeled data.
Article
Full-text available
The process of Nuclei detection in high-grade breast cancer images is quite challenging in the case of image processing techniques due to certain heterogeneous characteristics of cancer nuclei such as enlarged and irregularly shaped nuclei, highly coarse chromatin marginalized to the nuclei periphery and visible nucleoli. Recent reviews state that existing techniques show appreciable segmentation accuracy on breast histopathology images whose nuclei are dispersed and regular in texture and shape; however, typical cancer nuclei are often clustered and have irregular texture and shape properties. This paper proposes a novel segmentation algorithm for detecting individual nuclei from Hematoxylin and Eosin (H&E) stained breast histopathology images. This detection framework estimates a nuclei saliency map using tensor voting followed by boundary extraction of the nuclei on the saliency map using a Loopy Back Propagation (LBP) algorithm on a Markov Random Field (MRF). The method was tested on both whole-slide images and frames of breast cancer histopathology images. Experimental results demonstrate high segmentation performance with efficient precision, recall and dice-coefficient rates, upon testing high-grade breast cancer images containing several thousand nuclei. In addition to the optimal performance on the highly complex images presented in this paper, this method also gave appreciable results in comparison with two recently published methods-Wienert et al. (2012) and Veta et al. (2013), which were tested using their own datasets.
Article
Full-text available
Motivation: High-content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centered object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations. Results: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps. Availability and implementation: Torch7 implementation available upon request. Contact: oren.kraus@mail.utoronto.ca
Article
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
Article
Traditional supervised learning requires a training data set that consists of inputs and corre-sponding labels. In many applications, however, it is difficult or even impossible to accurately and consistently assign labels to inputs. A relatively new learning paradigm called Multi-ple Instance Learning allows the training of a classifier from ambiguously labeled data. This paradigm has been receiving much attention in the last several years, and has many useful applications in a number of domains (e.g. computer vision, computer audition, bioinformat-ics, text processing). In this report we review several representative algorithms that have been proposed to solve this problem. Furthermore, we discuss a number of existing and potential applications, and how well the currently available algorithms address the problems presented by these applications.
Article
The multiple instance problem arises in tasks where the training examples are ambiguous: a single example object may have many alternative feature vectors (instances) that describe it, and yet only one of those feature vectors may be responsible for the observed classification of the object. This paper describes and compares three kinds of algorithms that learn axis-parallel rectangles to solve the multiple instance problem. Algorithms that ignore the multiple instance problem perform very poorly. An algorithm that directly confronts the multiple instance problem (by attempting to identify which feature vectors are responsible for the observed classifications) performs best, giving 89% correct predictions on a musk odor prediction task. The paper also illustrates the use of artificial data to debug and compare these algorithms.