ArticlePDF Available

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Authors:

Abstract and Figures

We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Content may be subject to copyright.
arXiv:1312.6229v3 [cs.CV] 14 Jan 2014
OverFeat:
Integrated Recognition, Localization and Detection
using Convolutional Networks
Pierre Sermanet David Eigen
Xiang Zhang Michael Mathieu Rob Fergus Yann LeCun
Courant Institute of Mathematical Sciences, New York University
719 Broadway, 12th Floor, New York, NY 10003
sermanet,deigen,xiang,mathieu,fergus,yann@cs.nyu.edu
Abstract
We present an integrated framework for using Convolutional Networks for classi-
fication, localization and detection. We show how a multiscale and sliding window
approach can be efficiently implemented within a ConvNet. We also introduce a
novel deep learning approach to localization by learning to predict object bound-
aries. Bounding boxes are then accumulated rather than suppressed in order to
increase detection confidence. We show that different tasks can be learned simul-
taneously using a single shared network. This integrated framework is the winner
of the localization task of the ImageNet Large Scale Visual Recognition Challenge
2013 (ILSVRC2013) and obtained very competitive results for the detection and
classifications tasks. In post-competition work, we establish a new state of the art
for the detection task. Finally, we release a feature extractor from our best model
called OverFeat.
1 Introduction
Recognizing the category of the dominant object in an image is a tasks to which Convolutional
Networks (ConvNets) [17] have been applied for many years, whether the objects were handwrit-
ten characters [16], house numbers [?], textureless toys [18], traffic signs [3, ?], objects from the
Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]. The accuracy
of ConvNets on small datasets such as Caltech-101, while decent, has not been record-breaking.
However, the advent of larger datasets has enabled ConvNets to significantly advance the state of
the art on datasets such as the 1000-category ImageNet [5].
The main advantage of ConvNets for many such tasks is that the entire system is trained end to
end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design
a suitable feature extractor. The main disadvantage is their ravenous appetite for labeled training
samples.
The main point of this paper is to show that training a convolutional network to simultaneously
classify, locate and detect objects in images can boost the classification accuracy and the detection
and localization accuracy of all tasks. The paper proposes a new integrated approach to object
detection, recognition, and localization with a single ConvNet. We also introduce a novel method for
localization and detection by accumulating predicted bounding boxes. We suggest that by combining
many localization predictions, detection can be performed without training on background samples
and that it is possible to avoid the time-consuming and complicated bootstrapping training passes.
Not training on background also lets network focus solely on positive classes for higher accuracy.
1
Experiments are conducted on the ImageNet LSVRC 2012 and 2013 datasets and establish state of
the art results on the ILSVRC 2013 localization and detection tasks.
While images from the ImageNet classification dataset are largely chosen to contain a roughly-
centered object that fills much of the image, objects of interest sometimes vary significantly in size
and position within the image. The first idea is apply a ConvNet at multiple locations in the image,
in a sliding window fashion, and over multiple scales. One problem with this approach is that some
viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog)
while not containing the entire object and while not being centered on it. The second idea is to train
the system to not only produce a distribution over categories for each window, but also to produce a
prediction of the location and size of the bounding box containing the object relative to that of the
viewing window. The third idea is to accumulate the evidence for each categories at each location
and size.
Many authors have proposed to use ConvNets for detection and localization with a sliding window
over multiple scales, going back to the early 1990’s for multi-character strings [20], faces [30], and
hands [22]. More recently, ConvNets have been shown to yield state of the art performance on text
detection in natural images [4], face detection [8, 23] and pedestrian detection [25].
Several authors have also proposed to train ConvNets to directly predict the instantiation parameters
of the objects to be located, such as the position relative to the viewing window, or the pose of
the object. For example Osadchy et al. [23] describe a ConvNet for simultaneous face detection
and pose estimation. Faces are represented by a 3D manifold in the nine-dimensional output space.
Positions on the manifold indicate the pose (pitch, yaw, and roll). When the training image is a
face, the network is trained to produce a point on the manifold at the location of the known pose.
If the image is not a face, the output is pushed away from the manifold. At test time, the distance
to the manifold indicate whether the image contains a face, and the position of the closest point on
the manifold indicates pose. Taylor et al. [27, 28] use a ConvNet to estimate the location of body
parts (hands, head, etc) so as to derive the human body pose. They use a metric learning criterion
to train the network to produce points on a body pose manifold. Hinton et al. have also proposed
to train networks to compute explicit instantiation parameters of features as part of a recognition
process [12].
Other authors have proposed to perform object localization via ConvNet-based segmentation. The
simplest approach consists in training the ConvNet to classify the central pixel (or voxel for vol-
umetric images) of its viewing window as a boundary between regions or not [13]. But when the
regions must be categorized, it is preferable to perform semantic segmentation. The main idea is to
train the ConvNet to classify the central pixel of the viewing window with the category of the ob-
ject it belongs to, using the window as context for the decision. Applications range from biological
image analysis [21], to obstacle tagging for mobile robots [10] to tagging of photos [7]. The ad-
vantage of this approach is that the bounding contours need not be rectangles, and the regions need
not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for
training. This segmentation pre-processing or object proposal step has recently gained popularity in
traditional computer vision to reduce the search space of position, scale and aspect ratio for detec-
tion [19, 2, 6, 29]. Hence an expensive classification method can be applied at the optimal location
in the search space, thus increasing recognition accuracy. Additionally, [29, 1] suggest that these
methods improve accuracy by drastically reducing unlikely object regions, hence reducing potential
false positives. Our dense sliding window method however outperforms object proposal methods on
the ILSVRC13 detection dataset.
Krizhevsky et al. [15] recently demonstrated impressive classification performance using a large
ConvNet. The authors also entered the ImageNet 2012 competition, winning both the classification
and localization challenges. Although they demonstrated an impressive localization performance,
there has been no published work describing how their approach. Our paper is thus the first to
provide a clear explanation how ConvNets can be used for localization and detection for ImageNet
data.
In this paper we use the terms localization and detection in a way that is consistent with their use in
the ImageNet 2013 competition, namely that the only difference is the evaluation criterion used and
both involve predicting the bounding box for each object in the image.
2
Figure 1: Localization (top) and detection tasks (bottom). The left images contains our predic-
tions (ordered by decreasing confidence) while the right images show the groundtruth labels. The
detection image (bottom) illustrates the higher difficulty of the detection dataset, which can contain
many small objects while the classification and localization images typically contain a single large
object.
2 Vision Tasks
In this paper, we explore the following computer vision tasks by increasing order of difficulty: classi-
fication, localization and detection. Each task is a sub-task of the next. While all tasks were adressed
using a single framework and a shared feature learning base, we will describe them separately in the
following sections.
Throughout the paper, we report results on the 2013 ImageNet Large Scale Visual Recognition
Challenge (ILSVRC2013). Here is an overview of the challenge data and measures. For the classi-
fication task, each image is assigned a single label corresponding to the main object in the image.
Five guesses are allowed to find the correct answer because images can contain multiple unlabeled
objects. Localization is similar to classification in that 5 guesses are allowed per image. But addi-
tionally, a bounding box of the main object must be returned and must match with the groundtruth by
50% (using the PASCAL criterion of union over intersection). Each returned bounding box must be
labeled with the correct class, i.e. bounding boxes and labels are not dissociated. The detection task
differ from localization in that there can be any number of object in each image (including zero), and
that false positives are penalized by the mean average precision (mAP) measure. The localization
3
task is a convenient intermediate step between classification and detection in order to evluate a lo-
calization method independently of challenges specific to detection (such as learning a background
class for instance). In Fig. 1, we show examples of localization and images with groundtruths and
our predictions. Note that classification and localization share the same dataset, while detection con-
tains additional data where objects can be smaller. The detection data also contains a set of images
that do not contain certain objects. This can be used for bootstrapping, but we have not made use of
it in this work.
3 Classification
Our classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky et al. [15].
However we improve on the network design and the inference step. Because of time constraints,
some of the training features in Krizhevsky’s model were not explored, it is thus expected that
results can be improved even further. These are discussed in the future work section 6
Figure 2: Layer 1 (top) and layer 2 filters (bottom).
3.1 Model Design and Training
We train the network on the ImageNet 2012 training set (1.2 million images and C= 1000 classes)
[5]. Our model uses the same fixed input size approach proposed by Krizhevsky et al. [15] during
training but turns to multi-scale for classification as described in the next section. Each image is
downsampled so that the smallest dimension is 256 pixels. We then extract 5 random crops (and
their horizontal flips) of size 221x221 pixels and present these to the network in mini-batches of
size 128. The weights in the network are initialized randomly with (µ, σ) = (0,1×102). They
are then updated by stochastic gradient descent, accompanied by momentum term of 0.6and an 2
weight decay of 1×105. The learning rate is initially 5×102and is successively decreased by
a factor of 0.5after (30,50,60,70,80) epochs. DropOut [11] with a rate of 0.5is employed on the
fully connected layers (6th and 7th) in the classifier.
We detail the architecture sizes in tables 1 and 2. Note that during training, we treat this architecture
as non-spatial (output maps of size 1x1) as opposed to the inference step which produces spatial
outputs. Layers 1-5 are similar to Krizhevsky et al. [15], using rectification (“relu”) non-linearities
and max pooling, but with the following differences: (i) no contrast normalization is used; (ii)
pooling regions are non-overlapping and (iii) our model has larger 1st and 2nd layer feature maps,
thanks to a smaller stride (2 instead of 4). A larger stride is beneficial for speed but will hurt accuracy.
4
Output
Layer 1 2 3 4 5 6 7 8
Stage conv + max conv + max conv conv conv + max full full full
# channels 96 256 512 1024 1024 3072 4096 1000
Filter size 11x11 5x5 3x3 3x3 3x3 - - -
Conv. stride 4x4 1x1 1x1 1x1 1x1 - - -
Pooling size 2x2 2x2 - - 2x2 - - -
Pooling stride 2x2 2x2 - - 2x2 - - -
Zero-Padding size - - 1x1x1x1 1x1x1x1 1x1x1x1 - - -
Spatial input size 231x231 24x24 12x12 12x12 12x12 6x6 1x1 1x1
Table 1: Architecture specifics for model A (or “fast” model). The spatial size of the feature maps
depends on the input image size, which varies during our inference step – see Table 4. Here we show
training spatial sizes. Note that layer 5 is the top convolutional layer, with subsequent layers being
fully connected, being used a classifier which is applied in sliding window fashion to the layer 5
maps. These fully-connected layers can be seen as 1x1 convolutions in a spatial setting.
Output
Layer 1 2 3 4 5 6 7 8 9
Stage conv + max conv + max conv conv conv conv + max full full full
# channels 96 256 512 512 1024 1024 4096 4096 1000
Filter size 7x7 7x7 3x3 3x3 3x3 3x3 - - -
Conv. stride 2x2 1x1 1x1 1x1 1x1 1x1 - - -
Pooling size 3x3 2x2 - - - 3x3 - - -
Pooling stride 3x3 2x2 - - - 3x3 - - -
Zero-Padding size - - 1x1x1x1 1x1x1x1 1x1x1x1 1x1x1x1 - - -
Spatial input size 221x221 36x36 15x15 15x15 15x15 15x15 5x5 1x1 1x1
Table 2: Architecture specifics for model B (or “slow” model). It differs from the model A mainly
in the stride of the first convolution, the number of stages and the number of feature maps.
model # parameters (in millions) # connections (in millions)
Krizhevsky 60 -
A 145 2810
B 144 5369
Table 3: Number of parameters and connections for different models.
In Fig. 2, we show the filter coefficients from the first two convolutional layers. The first layer filters
capture orientated edges, patterns and blobs. In the second layer, the filters have a variety of forms,
some diffuse, others with strong line structures or oriented edges.
3.2 Feature Extractor
Along with this paper, we release a feature extractor dubbed “OverFeat” 1in order to provide pow-
erful features for computer vision research. Two models are provided, a fast and slow one. Each
architecture is described in tables 1 and 2. We also compare their sizes in Table 3 in terms of pa-
rameters and connections. The slow model is more accurate than the fast one (14.18% classification
error as opposed to 16.39% in Table 5), however it requires nearly twice as many connections. Using
a committee of 7 slow models reaches 13.6% classification error as shown in Fig. 4.
3.3 Multi-Scale Classification
In [15], multi-view voting is used to boost performance: a fixed set of 10 views (4 corners and center,
with horizontal flip) is averaged. Not only may this approach ignore some regions of the image, it
may also be computationally redundant if views overlap. Additionally, it is only applied at a single
scale, which may not be the scale at which the ConvNet will respond with optimal confidence.
Instead, we explore the entire image by densely running the network at each location and at multiple
1http://cilvr.nyu.edu/doku.php?id=software:overfeat:start
5
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining computationally
efficient. The result of convolving a ConvNet on an image of arbitrary size is a spatial map of
C-dimensional vectors at each scale.
The total subsampling ratio in the network described above is 2x3x2x3, or 36. Hence when ap-
plied densely, this architecture can only produce a classification vector every 36 pixels in the input
dimension along each axis. This coarse distribution of outputs decreases performance compared
to the 10-view scheme because the network windows are not well aligned with the objects in the
images. The better aligned the network window and the object, the strongest the confidence of the
network response. To circumvent this problem, we take the approach introduced by Giusti et al. [9]
by avoiding the last subsampling operation (x3), yielding a subsampling ratio of x12 instead of x36.
We now explain in detail how the resolution augmentation is performed. We use 6 scales of input
which result in unpooled layer 5 maps of varying resolution (see Table 4 for details). These are
then pooled and presented to the classifier using the following procedure, which is accompanied by
Fig. 3:
(a) For a single image, at a given scale, we start with the unpooled layer 5 feature maps.
(b) Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions),
repeated 3x3 times for (∆x,y)pixel offsets of {0,1,2}.
(c) This produces a set of pooled feature maps, replicated (3x3) times for different (∆x,y)com-
binations.
(d) The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output
vector for each location within the pooled maps. The classifier is applied in sliding-window
fashion to the pooled maps, yielding C-dimensional output maps (for a given (∆x,y)combi-
nation).
(e) The output maps for different (∆x,y)combinations are reshaped into a single 3D output map
(two spatial dimensions x Cclasses).
Figure 3: 1D illustration (to scale) of output map computation for classification, using y-dimension
from scale 2 as an example (see Table 4). (a): 20 pixel unpooled layer 5 feature map. (b): max
pooling over non-overlapping 3 pixel groups, using offsets of ∆ = {0,1,2}pixels (red, green, blue
respectively). (c): The resulting 6 pixel pooled maps, for different . (d): 5 pixel classifier (layers
6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by Cmaps for each .
(e): reshaped into 6 pixel by Coutput maps.
These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pool-
ing layers without subsampling and using skip-kernels in the following layer (where values in the
neighborhood are non-adjacent).
The procedure above is repeated for the horizontally flipped version of each image. We then produce
the final classification by (i) taking the spatial max for each class, at each scale and flip; (ii) averaging
the resulting C-dimensional vectors from different scales and flips and (iii) taking the top-1 or top-5
elements (depending on the evaluation criterion) from the mean class vector.
The scheme described above has several notable properties. First, the two halves of the network,
i.e. the feature extraction layers (1-5) and classifier layers (6-output), are used in opposite ways.
In the feature extraction portion, the filters are convolved across the entire image in one pass. For a
6
Input Layer 5 Layer 5 Classifier Classifier
Scale size pre-pool post-pool map (pre-reshape) map size
1 245x245 17x17 (5x5)x(3x3) (1x1)x(3x3)xC3x3xC
2 281x317 20x23 (6x7)x(3x3) (2x3)x(3x3)xC6x9xC
3 317x389 23x29 (7x9)x(3x3) (3x5)x(3x3)xC9x15xC
4 389x461 29x35 (9x11)x(3x3) (5x7)x(3x3)xC15x21xC
5 425x497 32x35 (10x11)x(3x3) (6x7)x(3x3)xC18x24xC
6 461x569 35x44 (11x14)x(3x3) (7x10)x(3x3)xC21x30xC
Table 4: Spatial dimensions of our multi-scale approach. 6 different sizes of input images are
used, resulting in layer 5 unpooled feature maps of differing spatial resolution (although not indi-
cated in the table, all have 256 feature channels). The (3x3) results from our dense pooling operation
with (∆x,y) = {0,1,2}. See text and Fig. 3 for details for how these are converted into output
maps.
computational perspective, this is far more efficient than sliding a fixed-size feature extractor over
the image and then aggregating the results from different locations2. However, these principles are
reversed for the classifier portion of the network. Here, we want to hunt for a fixed-size represen-
tation in the layer 5 feature maps across different positions and scales. Thus the classifier has a
fixed-size 5x5 input and is exhaustively applied to the layer 5 maps. Second, the overlapping pool-
ing scheme (with single pixel shifts (∆x,y)) ensures that we can obtain fine alignment between
the classifier and the representation of the object in the feature map input. Third, our pooling scheme
is similar to Giusti et al. [9] who shift the classifier’s viewing window by 1 pixel through pooling
layers without subsampling and use skip-kernels in the following layer (where values in the neigh-
borhood are non-adjacent). Finally, the dense manner in which the classifier is applied also helps
to improve performance. We explore this in Section 3.4, where we enable/disable the pixel shifts to
reveal their performance contribution.
3.4 Results
In Table 5, we experiment with different approaches and for reference compare them to the single
network model of Krizhevsky et al. [15]. The approach described above, with 6 scales, achieves a
top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance, the single-
scale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Fig. 3 brings a
relatively small improvement in the single scale regime, but is also of importance for the multi-scale
gains shown here.
Top-1 Top-5
Approach error % error %
Krizhevsky et al. [15] 40.7 18.2
OverFeat - 1 fast model, scale 1, coarse stride 39.28 17.12
OverFeat - 1 fast model, scale 1, fine stride 39.01 16.97
OverFeat - 1 fast model, 4 scales (1,2,4,6), fine stride 38.57 16.39
OverFeat - 1 fast model, 6 scales (1-6), fine stride 38.12 16.27
OverFeat - 1 big model, 4 corners + center + flip 35.60 14.71
OverFeat - 1 big model, 4 scales, fine stride 35.74 14.18
OverFeat - 7 fast models, 4 scales, fine stride 35.10 13.86
OverFeat - 7 big models, 4 scales, fine stride 33.96 13.24
Table 5: Classification experiments on validation set. Fine/coarse stride refers to the number of
values used when applying the classifier. Fine: = 0,1,2; coarse: ∆ = 0.
We report the test set results of the 2013 competition in Fig. 4 where our model (OverFeat) obtained
14.2% accuracy by voting of 7 ConvNets (each trained with different initializations) and ranked 5th
out of 18 teams. The best accuracy using only ILSVRC13 data was 11.7%. Pre-training with extra
2Our network with 6 scales takes around 2 secs on a K20x GPU to process one image
7
Figure 4: Test set classification results. During the competition, OverFeat yielded 14.2% top 5
error rate using an average of 7 fast models. In post-competition work, OverFeat ranks fifth with
13.6% error using bigger models (more features and more layers).
data from the ImageNet Fall11 dataset improved this number to 11.2%. In post-competition work,
we improve the OverFeat results down to 13.6% error by using bigger models (more features and
more layers). Due to time constraints, these bigger models are not fully trained, more improvements
are expected to appear in time.
3.5 ConvNets and Sliding Window Efficiency
ConvNets are efficient in terms of learning because sharing the weights at multiple locations regu-
larizes the filters to be more general and speeds up learning by accumulating more gradients. But by
nature, ConvNets are also computationally efficient when applied densely, i.e. no redundant com-
putations are performed, as opposed to other architectures that have to recompute the entire pipeline
for each output unit. For ConvNets, neighboring outputs units share common inputs in lower layers.
For example, applying a ConvNet to its minimum window size will produce a spatial output size of
1x1, as in Fig. 5. Extending to outputs of size 2x2 requires only to recompute a minimal part of the
features (yellow region in Fig. 5).
Note that while the last layers of our architecture are fully connected linear layers, during detection
these layers are effectively replaced by convolution operations with kernels of 1x1. Then the entire
ConvNet is simply a sequence of convolutions, max-pooling and thresholding operations only.
4 Localization
Starting from our classification-trained network, we replace the classifier layers by a regression
network and train it to predict object bounding boxes at each spatial location and scale. We then
combine the regression predictions together into objects and in turn combine these with the classifi-
cation results of each location, as we now describe.
4.1 Generating Predictions
To generate object bounding box predictions, we simultaneously run the classifier and regressor
networks across all locations and scales. Since these share the same feature extraction layers, only
the final regression layers need to be recomputed after computing the classification network. The
8
input 1st stage output
classifier
convolution pooling conv conv conv
input 1st stage output
classifier
convolution pooling conv conv conv
input 1st stage output
classifier
convolution pooling conv conv conv
Figure 5: The efficiency of ConvNets for detection. During training, a ConvNet produces only
1 spatial output (top). But when applied densely over a bigger input image, it produces a spatial
output map, e.g. 2x2 (middle). Since all layers of a ConvNet are applied convolutionally, only the
yellow region needs to be recomputed when comparing to the top diagram. The feature dimension
was removed for simplicity in the top and middle diagrams and added to the bottom diagram.
output of the final softmax layer for a class cat each location provides a score of confidence that
an object of class cis present (though not necessarily fully contained) in the corresponding field of
view. Thus we can assign to each bounding box a confidence.
Localization within a view is performed by training a regressor on top of the classification network
features, described in Section 3, to predict the bounding box of the object.
4.2 Regressor Training
The regression network takes as input the pooled feature maps from layer 5. It has 2 fully-connected
hidden layers of size 4096 and 1024 channels, respectively. The output layer is different for each
class, and has 4 units which specify the coordinates for the bounding box edges. As with classifica-
tion, there are (3x3) copies throughout, resulting from the x,yshifts. The architecture is shown
in Fig. 8.
We fix the feature extraction layers (1-5) from the classification network and train the regression
network using an 2loss between the predicted and true bounding box for each example. The final
regressor layer is class-specific, having 1000 different versions, one for each class. We train this
network using the same set of scales as described in Section 3. We compare the prediction of the
regressor net at each spatial location with the ground-truth bounding box, shifted into the frame of
9
Figure 6: Localization/Detection pipeline. The raw classifier/detector outputs a class and a con-
fidence for each location (1st diagram). The resolution of these predictions can be increased using
the method described in section 3.3 (2nd diagram). The regression then predicts the location scale
of the object with respect to each window (3rd diagram). These bounding boxes are then merge and
accumulated to a small number of objects (4th diagram).
10
Figure 7: Examples of bounding boxes produced by the regression network, before being com-
bined into final predictions. The examples shown here are at a single scale. Predictions may be
more optimal at other scales depending on the objects. Here, most of the bounding boxes which are
initially organized as a grid, convergeto a single location and scale. This indicates that the network
is very confident in the location of the object, as opposed to being spread out randomly. The top left
image shows that it can also correctly identify multiple location if several objects are present. The
various aspect ratios of the predicted bounding boxes shows that the network is able to cope with
various object poses.
reference of the regressor’s translation offset within the convolution (see Fig. 8). However, we do
not train the regressor on bounding boxes with less than 50% overlap with the input field of view:
since the object is mostly outside of these locations, it will be better handled by regression windows
that do contain the object.
Training the regressors in a multi-scale manner is important for the across-scale prediction combi-
nation. Training on a single scale will perform well on that scale and still perform reasonably on
other scales. However training multi-scale will make predictions match correctly across scale and
exponentially increase the confidence of the merged predictions. In turn, this allows to perform well
with a few scales only rather than many scales as is typically the case in detection. The typical ratio
from one scale to another in pedestrian detection [25] is about 1.05 to 1.1, here however we use a
large ratio of approximately 1.4 (this number differs for each scale since dimensions are adjusted to
fit exactly the stride of our network) which allows us to run our system faster.
4.3 Combining Predictions
We combine the individual predictions (see Fig. 7) via a greedy merge strategy applied to the regres-
sor bounding boxes, using the following algorithm.
(a) Assign to Csthe set of classes in the top kfor each scale s1...6, found by taking the
maximum detection class outputs across spatial locations for that scale.
(b) Assign to Bsthe set of bounding boxes predicted by the regressor network for each class in Cs,
across all spatial locations at scale s.
(c) Assign BSsBs
(d) Repeat merging until done:
(e) (b
1, b
2) = argminb16=b2Bmatch score(b1,b2)
(f) If match score(b
1, b
2)> t , stop.
11
Figure 8: Application of the regression network to layer 5 features, at scale 2, for example. (a)
The input to the regressor at this scale are 6x7 pixels spatially by 256 channels for each of the
(3x3) x,yshifts. (b) Each unit in the 1st layer of the regression net is connected to a 5x5 spatial
neighborhood in the layer 5 maps, as well as all 256 channels. Shifting the 5x5 neighborhood around
results in a map of 2x3 spatial extent, for each of the 4096 channels in the layer, and for each of
the (3x3) x,yshifts. (c) The 2nd regression layer has 1024 units and is fully connected (i.e. the
purple element only connects to the purple element in (b), across all 4096 channels). (d) The output
of the regression network is a 4-vector (specifying the edges of the bounding box) for each location
in the 2x3 map, and for each of the (3x3) x,yshifts.
(g) Otherwise, set BB\{b
1, b
2} ∪ box merge(b
1, b
2)
In the above, we compute match score using the sum of the distance between centers of the two
bounding boxes and the intersection area of the boxes. box merge compute the average of the
bounding boxes’ coordinates.
The final prediction is given by taking the merged bounding boxes with maximum class scores. This
is computed by cumulatively adding the detection class outputs associated with the input windows
from which each bounding box was predicted. See Fig. 6 for an example of bounding boxes merged
into a single high-confidence bounding box. In that example, some turtle and whale bounding boxes
appear in the intermediate multi-scale steps, but disappear in the final detection image. Not only
these bounding boxes have low classification confidence (at most 0.11 and 0.12 respectively), their
collection is not as coherent as the bear bounding boxes to get a significant confidence boost. The
bear boxes however have a strong confidence (approximately 0.5 average confidence per scale) and
high matching scores. Hence after merging, many bear bounding boxes fused into a single very
high confidence box, while false positives disappear below the detection threshold due their lack of
bounding boxes coherence and confidence. This analysis suggest that our approach is naturally more
robust to false positives coming from the pure-classification model than traditional non-maximum
suppression, by rewarding bounding boxes coherence.
4.4 Experiments
We apply our network to the Imagenet 2012 validation set, using the localization criterion specified
for the competition. The results are shown in Fig. 9. Training and testing data are the same for 2012
and 2013 competitions ; results are reported for both in Fig. 10. Our method is the winner of the
2013 competition with 29.9% error.
12
Figure 9: Localization experiments on ILSVRC12 validation set. We experiment with different
number of scales and with the use of single-class regression (SCR) or per-class regression (PCR).
Our multiscale and multi-view approach was critical to obtaining good performance, as can be seen
in Fig. 9: Using only a single centered crop, our regressor network achieves an error rate of 40%. By
combining regressor predictions from all spatial locations at two scales, we achieve a vastly better
error rate of 31.5%. Adding a third and fourth scale further improves performance to 30.0% error.
Using a different top layer for each class in the regressor network for each class (Per-Class Regres-
sor (PCR) in Fig. 9) surprisingly did not outperform using only a single network shared among all
classes (44.1% vs. 31.3%). This may be because there are relatively few examples per class an-
notated with bounding boxes in the training set, while the network has 1000 times more top-layer
parameters, resulting in insufficient training. It is possible this approach may be improved by shar-
ing parameters only among similar classes (e.g. training one network for all classes of dogs, another
for vehicles, etc.).
5 Detection
Detection training is similar to classification training but in a spatial manner. Multiple location of
an image may be trained simultaneously. Since the model is convolutional, all weights are shared
among all locations. The main difference with the localization task, is the necessity to predict a
background class when no object is present. Traditionally, negative examples are initially taken at
random for training. Then the most offending negative errors are added to the training set in boot-
strapping passes. Independent bootstrapping passes render training complicated and risk potential
mismatches between the negative examples collection and training times. Additionally, the size of
bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To cir-
cumvent all these problems, we perform negative training on the fly, by selecting a few interesting
negative examples per image such as random ones or most offending ones. This approach is more
computationally expensive, but renders the procedure much simpler. And since the feature extraction
is initially trained with the classification task, the detection fine-tuning is not as long anyway.
In Fig. 11, we report the results of the ILSVRC 2013 competition where our detection system ranked
3rd with 19.4% mean average precision (mAP). We later established a new detection state of the art
with 24.3% mAP. Note that there is a large gap between the top 3 methods and other teams (the 4th
method yields 11.5% mAP). Additionally, our approach is considerably different from the top 2 other
systems which use an initial segmentation step to reduce candidate windows from approximately
200,000 to 2,000. This technique speeds up inference and substantially reduces the number of
potential false positives. [29, 1] suggest that detection accuracy drops when using dense sliding
window as opposed to selective search which discards unlikely object locations hence reducing
false positives. Combined with our method, we may observe similar improvements as seen here
13
Figure 10: ILSVRC12 and ILSVRC13 competitions results (test set). Our entry is the winner of
the ILSVRC13 localization competition with 29.9% error (top 5). Note that training and testing data
is the same for both years. The OverFeat entry uses 4 scales and a single-class regression approach.
Figure 11: ILSVRC13 test set Detection results. During the competition, UvA ranked first with
22.6% mAP. In post competition work, we establish a new state of the art with 24.3% mAP. Systems
marked with * were pre-trained with the ILSVRC12 classification data.
between traditional dense methods and segmentation based methods. It should also be noted that
we did not fine tune on the detection validation set as NEC and UvA did. The validation and test
set distributions differ significantly enough from the training set that this alone improves results by
approximately 1 point. The improvement between the two OverFeat results in Fig. 11 are due to
longer training times and the use of context, i.e. each scale also uses lower resolution scales as
input.
14
6 Discussion
We have shown a multi-scale, sliding window approach that can be used for classification, local-
ization and detection. We applied it to the ILSVRC 2013 datasets and it currently ranks 4th on
classification, 1st on localization and 1st on detection. A second important contribution of our paper
is explaining how ConvNets can be effectively used for detection and localizationtasks. These were
never addressed in [15] and thus we are the first to explain how this can be done in the context of Im-
ageNet 2012. The scheme we propose involves substantial modifications to networks designed for
classification, but clearly demonstrate that ConvNets are capable of these more challenging tasks.
Our localization approach won the 2013 ILSVRC competition and significantly outperformed 2012
and 2013 approaches. The detection model was among the top performers during the competition
and ranks first in post-competition results. We have proposed an integrated pipeline that can perform
different tasks while sharing a common feature extraction baseline entirely learned directly from the
pixels.
Our approach could be improved in numerous ways: (i) for localization, we are not currently back-
propping through the whole network; doing so is likely to improve performance. (ii) we are using
2loss, rather than directly optimizing the intersection-over-union (IOU) criterion on which perfor-
mance is measured. Swapping the loss to this should be possible since IOU is still differentiable,
provided there is some overlap. (iii) alternate parameterizations of the bounding box may help to
decorrelate the outputs, which will aid network training.
References
[1] J. Carreira, F. Li, and C. Sminchisescu. Object recognition by sequential figure-ground ranking. Interna-
tional journal of computer vision, 98(3):243–262, 2012.
[2] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation,
release 1. http://sminchisescu.ins.uni-bonn.de/code/cpmc/.
[3] D. C. Ciresan, J. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification.
In CVPR, 2012.
[4] M. Delakis and C. Garcia. Text detection with convolutional neural networks. In International Conference
on Computer Vision Theory and Applications (VISAPP 2008), 2008.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR09, 2009.
[6] I. Endres and D. Hoiem. Category independent object proposals. In Computer Vision–ECCV 2010, pages
575–588. Springer, 2010.
[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2013. in press.
[8] C. Garcia and M. Delakis. Convolutional face finder: A neural architecture for fast and robust face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004.
[9] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep
max-pooling convolutional neural networks. In International Conference on Image Processing (ICIP),
2013.
[10] R. Hadsell, P. Sermanet, M. Scoffier, A. Erkan, K. Kavackuoglu, U. Muller, and Y. LeCun. Learning
long-range vision for autonomous off-road driving. Journal of Field Robotics, 26(2):120–144, February
2009.
[11] G. Hinton, N. Srivastave, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural net-
works by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
[12] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Artificial Neural Networks
and Machine Learning–ICANN 2011, pages 44–51. Springer Berlin Heidelberg, 2011.
[13] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, M. Helmstaedter, W. Denk, and H. S.
Seung. Supervised learning of image restoration with convolutional networks. In ICCV’07.
[14] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for
object recognition? In Proc. International Conference on Computer Vision (ICCV’09). IEEE, 2009.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural
networks. In NIPS 2012: Neural Information Processing Systems.
[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Hand-
written digit recognition with a back-propagation network. In D. Touretzky, editor, Advances in Neural
Information Processing Systems (NIPS 1989), volume 2,Denver, CO, 1990. Morgan Kaufman.
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, November 1998.
15
[18] Y. LeCun, F.-J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance
to pose and lighting. In Proceedings of CVPR’04. IEEE Press, 2004.
[19] S. Manen, M. Guillaumin, and L. Van Gool. Prime object proposals with randomized prims algorithm. In
International Conference on Computer Vision (ICCV), 2013.
[20] O. Matan, J. Bromley, C. Burges, J. Denker, L. Jackel, Y. LeCun, E. Pednault, W. Satterfield, C. Stenard,
and T. Thompson. Reading handwritten digits: A zip code recognition system. IEEE Computer, 25(7):59–
63, July 1992.
[21] F.Ning, D. Delhomme, Y. LeCun, F.Piano, L. Bottou, and P.Barbano. Toward automatic phenotyping of
developing embryos from videos. IEEE Transactions on Image Processing, 14(9):1360–1371, September
2005. Special issue on Molecular and Cellular Bioimaging.
[22] S. Nowlan and J. Platt. A convolutional neural network hand tracker. pages 901–908, San Mateo, CA,
1995. Morgan Kaufmann.
[23] M. Osadchy, Y. LeCun, and M. Miller. Synergistic face detection and pose estimation with energy-based
models. Journal of Machine Learning Research, 8:1197–1215, May 2007.
[24] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit
classification. In ICPR, 2012.
[25] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-
stage feature learning. In Proc. International Conference on Computer Vision and Pattern Recognition
(CVPR’13). IEEE, June 2013.
[26] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In IJCNN,
2012.
[27] G. Taylor, R. Fergus, G. Williams, I. Spiro, and C. Bregler. Pose-sensitive embedding by nonlinear nca
regression. In NIPS, 2011.
[28] G. Taylor, I. Spiro, C. Bregler, and R. Fergus. Learning invarance through imitation. In CVPR, 2011.
[29] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object
recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
[30] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE
Proc on Vision, Image, and Signal Processing, 141(4):245–250, August 1994.
16
... This automatic feature extraction capability allows for more accurate and comprehensive analysis of lung nodules. For instance, Ciompi et al. 62 developed a model based on OverFeat 63,64 , which extracts 2D-view feature vectors from CT scans in three planes: axial, coronal, and sagittal. These vectors enable a multi-view analysis of the nodule, capturing essential features for classification. ...
Article
Full-text available
In the age of big data, privacy, particularly medical data privacy, is becoming increasingly important. Differential privacy (DP) has emerged as a key method for safeguarding privacy during data analysis and publishing. Cancer identification and classification play a vital role in early detection and treatment. This paper introduces a novel algorithm, DPFCM_GK, which combines differential privacy with fuzzy c-means (FCM) clustering using a Gaussian kernel function. The algorithm enhances cancer detection while ensuring data privacy. Three publicly available lung cancer datasets, along with a dataset from our hospital, are used to test and demonstrate the effectiveness of DPFCM_GK. The experimental results show that DPFCM_GK achieves high clustering accuracy and enhanced privacy as the privacy budget (ε) increases. For the UCIML, NLST, and NSCLC datasets, it reaches optimal results at lower ε (1.52, 1.24, and 2.32) compared to DPFCM. In the lung cancer dataset, DPFCM_GK outperforms DPFCM within, 0.05 ≤ ε ≤ 2.5, with significant differences (χ² = 4.54 ∼ 29.12; P < 0.05), and both methods converge to an accuracy of 94.5% as ε increases. Although differential privacy initially increases iteration counts, DPFCM_GK demonstrates faster convergence and fewer iterations compared to DPFCM, with significant reductions (T= 23.08, 43.47, and 48.93; P<0.05). For the UCIML dataset, DPFCM_GK significantly reduces runtime compared to other models (DPFCM, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, LDP-FL) under the same privacy budget. The runtime reduction is statistically significant with T-values of (T = 21.08, 316.24, 102.35, 222.37, 162.23, 159.25; P < 0.05). DPFCM_GK still maintains excellent time efficiency when applied to the NLST and NSCLC datasets(P < 0.05). For the LLCS dataset, For the LLCS dataset, the DPFCM_GK demonstrates significant improvement as the privacy budget increases, especially in low-budget scenarios, where the performance gap is most pronounced (T=4.20, 8.44, 10.92, 3.95, 7.16, 8.51, P < 0.05). These results confirm DPFCM_GK as a practical solution for medical data analysis, balancing accuracy, privacy, and efficiency.
Chapter
Artificial intelligence (AI) as a field of study has grown exponentially in recent years, as has applications in a multitude of fields. Within it, deep learning techniques have shown truly impressive results on all kinds of pattern recognition tasks during the last decade, and their impact will only continue to grow. To understand the virtues and weaknesses of this technology, it may be beneficial to look back at the technological advances within the field of AI and, more specifically, of computer vision and neural networks, where the ideas that gave rise to what is now known as deep learning were first developed. That is why the main objective of this chapter is to give as complete a vision as possible of the history of this technology, throughout its nearly 80 years of development. By reviewing the scientific publications that brought progress in this area of research in chronological order, this chapter will elucidate the origins of the main components of neural networks and will highlight the main contributors in this field, as well as the factors that resulted in setbacks and advances.
Article
Comprehensive reviews of continuously vegetated areas to determine dispersed locations of invasive species require intensive use of computational resources. Furthermore, effective mechanisms aiding identification of locations of specific invasive species require approaches relying on geospatial indicators and ancillary images. This study develops a two-stage data workflow for the invasive species Kudzu vine (Pueraria montana) often found in small areas along roadsides. The INHABIT database from the United States Geological Survey (USGS) provided geospatial data of Kudzu vines and Google Street View (GSV) a set of images. Stage one built up a set of Kudzu images to be implemented in an object detection technique, You Only Look Once (YOLO v8s), for training, validating, and testing. Stage two defined a dataset of confirmed locations of Kudzu which was followed to retrieve images from GSV and analyzed with YOLO v8s. The effectiveness of the YOLO v8s model was assessed to determine the locations of Kudzu identified from georeferenced GSV images. This data workflow demonstrated that field observations can be virtually conducted by integrating geospatial data and GSV images; however, its potential is confined to the updated periodicity of GSV images or similar services.
Chapter
Object detection, a core task in computer vision, involves identifying and localizing objects in images or videos. Recent deep learning advances have significantly improved accuracy and speed. This chapter explores traditional two-stage methods and modern one-stage techniques. The chapter begins by tracing the history of deep learning and its pivotal role in advancing object detection, followed by a discussion of performance metrics used to evaluate detection accuracy and inference time. A detailed examination of the YOLO series, from its inception to the latest iteration, YOLOv8, highlights the architectural innovations and contributions of each version. Additionally, the chapter addresses the significance of backbone networks and benchmark datasets in driving research progress. Key challenges in the field, including scale and class imbalance, are also analyzed. The chapter concludes by identifying recent trends and future research directions, offering a comprehensive resource for understanding the current state and potential applications of object detection technologies.
Article
Full-text available
This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi.unitn.it/~uijlings/SelectiveSearch.html).
Article
Full-text available
We present an approach to visual object-class segmentation and recognition based on a pipeline that combines multiple figure-ground hypotheses with large object spatial support, generated by bottom-up computational processes that do not exploit knowledge of specific categories, and sequential categorization based on continuous estimates of the spatial overlap between the image segment hypotheses and each putative class. We differ from existing approaches not only in our seemingly unreasonable assumption that good object-level segments can be obtained in a feed-forward fashion, but also in formulating recognition as a regression problem. Instead of focusing on a one-vs.-all winning margin that may not preserve the ordering of segment qualities inside the non-maximum (non-winning) set, our learning method produces a globally consistent ranking with close ties to segment quality, hence to the extent entire object or part hypotheses are likely to spatially overlap the ground truth. We demonstrate results beyond the current state of the art for image classification, object detection and semantic segmentation, in a number of challenging datasets including Caltech-101, ETHZ-Shape as well as PASCAL VOC 2009 and 2010.
Article
Full-text available
Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a (320×240)(320\times 240) image labeling in less than a second, including feature extraction.
Article
Full-text available
Deep Neural Networks now excel at image classification, detection and segmentation. When used to scan images by means of a sliding window, however, their high computational complexity can bring even the most powerful hardware to its knees. We show how dynamic programming can speedup the process by orders of magnitude, even when max-pooling layers are present.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Conference Paper
Generic object detection is the challenging task of proposing windows that localize all the objects in an image, regardless of their classes. Such detectors have recently been shown to benefit many applications such as speeding-up class-specific object detection, weakly supervised learning of object detectors and object discovery. In this paper, we introduce a novel and very efficient method for generic object detection based on a randomized version of Prim's algorithm. Using the connectivity graph of an image's super pixels, with weights modelling the probability that neighbouring super pixels belong to the same object, the algorithm generates random partial spanning trees with large expected sum of edge weights. Object localizations are proposed as bounding-boxes of those partial trees. Our method has several benefits compared to the state-of-the-art. Thanks to the efficiency of Prim's algorithm, it samples proposals very quickly: 1000 proposals are obtained in about 0.7s. With proposals bound to super pixel boundaries yet diversified by randomization, it yields very high detection rates and windows that tightly fit objects. In extensive experiments on the challenging PASCAL VOC 2007 and 2012 and SUN2012 benchmark datasets, we show that our method improves over state-of-the-art competitors for a wide range of evaluation scenarios.
Article
Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage.