The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task
-
Citations (0)
-
Cited In (0)
Page 1
The joint submission of the TU Berlin and Fraunhofer
FIRST (TUBFI) to the ImageCLEF2011 Photo
Annotation Task
Alexander Binder1, Wojciech Samek1,2, Marius Kloft1, Christina M¨ uller1,
Klaus-Robert M¨ uller1, and Motoaki Kawanabe2,1
1Machine Learning Group, Berlin Institute of Technology (TU Berlin), Franklinstr. 28/29,
10587, Berlin, Germany, www.ml.tu-berlin.de
alexander.binder@tu-berlin.de, wojciech.samek.tu-berlin.de
2Fraunhofer Institute FIRST, Kekul´ estr. 7, 12489 Berlin, Germany
motoaki.kawanabe@first.fraunhofer.de
Abstract. In this paper we present details on the joint submission of TU Berlin
andFraunhoferFIRSTtotheImageCLEF2011PhotoAnnotationTask.Wesought
to experiment with extensions of Bag-of-Words (BoW) models at several levels
and to apply several kernel-based learning methods recently developed in our
group. For classifier training we used non-sparse multiple kernel learning (MKL)
and an efficient multi-task learning (MTL) heuristic based on MKL over kernels
from classifier outputs. For the multi-modal fusion we used a smoothing method
on tag-based features inspired by Bag-of-Words soft mappings and Markov ran-
dom walks. We submitted one multi-modal run extended by the user tags and
four purely visual runs based on Bag-of-Words models. Our best visual result
which used the MTL method was ranked first according to mean average preci-
sion (MAP) within the purely visual submissions. Our multi-modal submission
achieved the first rank by MAP among the multi-modal submissions and the best
MAP among all submissions. Submissions by other groups such as BPACAD,
CAEN, UvA-ISIS, LIRIS were ranked closely.
Keywords: ImageCLEF,PhotoAnnotation,ImageClassification,Bag-of-Words,
Multi-Task Learning, Multiple Kernel Learning, THESEUS
1Introduction
Our goals were to experiment with extensions of Bag-of-Words (BoW) models at sev-
eral levels and to combine them with several kernel-based learning methods recently
developed in our group while working within the THESEUS project. For this purpose
we generated a submission to the annotation task of the ImageCLEF2011 Photo An-
notation Challenge [14]. This task required the annotation of 10000 images in the pro-
vided test corpus according to the 99 pre-defined categories. Note that this year’s Im-
ageCLEF Photo-based task provides additionally another challenging competition [14],
a concept-based retrieval task. In the following we will focus on the firstly mentioned
annotation task over the 10000 images. The ImageCLEF photo corpus is challenging
Page 2
2
Table 1: BoW Feature Sets. See text for explanation.
Sampling Type Local Feature
grid
grid
grid
bias1
bias2
bias3
bias4
Color ChannelsBoW Mapping No. of Features
Rank
0-1
Rank
Rank
Color Quantiles RGB, Opp,Gr
SIFT
SIFT
SIFT
SIFT
SIFT
SIFT
9
12
12
9
15
6
9
RGB, Opp,Gr, N-Opp
RGB, Opp,Gr, N-Opp
RGB, Opp,Gr
RGB, Opp,Gr, N-Opp,N-RGB Rank
RGB, Opp
RGB, Opp,Gr
Rank
Rank
due to its heterogeneity of classes. It contains classes based on concrete tangible ob-
jects such as female, cat and vehicle as well as more abstractly defined classes such
as technical, boring or Esthetic Impression. As a result our visual submission and our
multi-modal submission achieved both first ranks by MAP measure among the purely
visual and multi-modal submissions, respectively. We will describe our methods in a
concise manner here.
2 Bag-of-Words Features
All our submissions were based on discriminatively trained classifiers over kernels us-
ing BoW features. The BoW feature pipeline can be decomposed into the following
steps: generating sampling regions, computing local features, mapping local features
onto visual words. The coarse layout of our approach is influenced by the works of the
Xerox group on Bag-of-Words in Computer Vision [3], the challenge submissions by
INRIA groups [11] and the works on color descriptors by the University of Amster-
dam [17]. For that reason we computed for each set of parameters three BoW features
based on regular spatial tilings 1 × 1,2 × 2,3 × 1 (vertical × horizontal). Preliminary
experiments with an additional spatial tiling 3 × 3 showed merely minor performance
gains. Furthermore we used vectors of quantile estimators along the established SIFT
feature [10] as local feature. Table 1 shows the computed BoW features. Information
about the sampling method is given in Section 2.1. We used color channel combinations
red-green-blue (RGB), grey (Gr), grey-opponentcolor1-opponentcolor2 (Opp in Table
1) and a grey-value normalized version of the last combination (N-Opp in Table 1). The
total number of kernels is large however their computation is a fairly automatized task
which requires little human intervention.
In this years submission, we incorporated the following new extensions described
in Sections 2.1 and 2.2 into our BoW modeling.
2.1Extensions on Sampling level
In addition to BoW features created from known grid sampling we tried biased random
sampling [21]. In contrast to [21] we resorted to probability maps computed from edge
detectors.SuchsamplingapproachesoffertwopotentialadvantagesoverHarrisLaplace
detectors: Firstly, we get keypoints located on edges rather than corners. A motivating
Page 3
3
example can be seen in Figure 1 – the bridge contains corner points but the essential
structures are lines. Similar examples are smooth borders of buildings, borders between
mountains and sky, or simply a circular structure.
Fig.1: Upper Left: The essential structures of the bridge are lines rather than corners (author’s
own work). Upper Right: Harris Laplace keypoints. Lower Left: bias1 keypoints. Lower Right:
bias4 keypoints with same number of keypoints as detected by Harris Laplace.
Secondly, we did adjust the number of local features to be extracted per image as
a function of the image size instead of using the typical corner detection thresholds.
The reference is the number of local features extracted by grid sampling, in our case 6
pixels. This comes from the idea that some images can be more smooth in general. Fur-
thermore [13] showed that too sparse sampling of local features leads to reduced clas-
sification performance. The opposite extreme end of this is documented in [18] where
quite large improvements using sampling each pixel are reported. As a consequence
we can tune the trade-off between computational cost and performance compared to
the dense sampling baseline. In practice we chose to extract approximately one half as
much local features using biased random sampling. We tried four detectors:
– bias3 was a simplified version of an attention based detector [7]. However this de-
tector requires to set scale parameters. The highly varying scales of motifs in the
images makes it difficult to find a globally optimal set of scales without expensive
optimizations. This inspired us to try detectors which depend less on scale param-
eters:
Page 4
4
– bias1 computes an average of gradient responses over pixel-wise images of the
following color channels: grey, red minus green, green minus blue and blue minus
red.
– bias2 is like bias1 except for dropping the grey channel. Thus it will fail on grey
images but detects strong local color variations. On the other hand such differences
between RGB color channels are more prominent on bright regions. This allows to
use features over normalized color channels more safely on color images.
– bias4 takes the same set of color channels as the underlying SIFT descriptor and
computes the entropy of the gradient orientation histogram on the same scale as
the SIFT descriptor. Regions with low entropy are preferred in the probability map
used for biased random sampling. This detector is adapted closely to the SIFT fea-
ture. The question behind this detector is whether the focus on peaky low entropy
histograms constitutes an advantage.
2.2
As we used k-means for generating a set of visual words, the usual approach to generate
soft BoW mappings [5] which is adapted to radius-based clustering and relies on one
global width parameter may become inappropriate when the density of clusters varies
strongly in the space of local features. K-means results in clusters of varying size de-
pending on the local density of the local features. To resolve this issue we resorted to
rank-based BoW mapping where the vote of a local feature is the 2.4-based power of
the negative rank. Be RKd(l) the rank of the distances between the local feature l and
the visual word corresponding to BoW dimension d, sorted in increasing order. Then
the BoW mapping mdfor dimension d is defined as:
?
0
Extensions on Bag-of-Words Mapping Level
md(l) =
2.4−RKd(l)
if RKd(l) ≤ 8
else.
(1)
Initially we performed experiments with several alternative soft mappings. Shortly
summarized, these experiments revealed that it is necessary to achieve a sufficiently fast
decay of soft mapping weights as a function of the distance of a local feature to distant
visual words in order to achieve a better performance than simple hard mapping.
Our second attempt after using the mapping from [5] was to introduce a cutoff
constant K. Only distances below rank K+1 are considered. Be V a visual vocabulary,
and wdthe visual word from it corresponding to BoW feature dimension d, l a local
feature. Then the cut-off mapping is given by:
?
0
md(l) =
exp(−σwddist(l,wd))
v∈V|Rank(dist(l,v))≤Kexp(−σvdist(l,v))
P
if Rank(dist(l,wd)) ≤ K
otherwise
(2)
where the width parameter σ was estimated for each visual word locally as the inverse
of quantile estimators of distances to all local features from an image which had wdas
the nearest visual word.
This experiment led to the conclusion that quantiles leading to large values for σ
and thus fast decay of weights yielded better performances.
Note that the rank-based voting ensures exponential drop-off per se.
Page 5
5
2.3Kernels
We used χ2-Kernels. The width was set to be the mean of the inner distances.
2.4Used Resources
For feature and kernel computations we resorted to a cluster with 40 mostly AMD
Opterons275CoreUnitswithupto2.4GHzwhichhadaccordingtocpubenchmark.net
a speed rank of 134 in August 2011. The OS was a 32bit which limited usable memory
resources during feature computation, in particular during visual word generation to 3
GByte.
3Heuristically down-scaled non-sparse Multiple Kernel Learning
Due to limited resources on a 64 bit cluster which we employed for classifier train-
ing we decided to try out a down-scaled version of MKL based on 25 kernels which
are the averages of the 75 kernels over the spatial tilings. Instead of evaluating many
pairs of sparsity parameters and regularization constants the idea was to run non-sparse
MKL [9] once for each class for merely one sparsity parameter tuned towards low
kernel weight regularization (p = 1.2) and one choice of the regularization constant
tuned towards high SVM regularization (C = 0.1). The obtained kernel weights can
be used afterwards in SVMs with fixed-weighted kernels and several weaker SVM reg-
ularizations and powers applied to the kernel weights simulating higher sparsity. This
consumes substantially less memory and allows in practice to use more cores in paral-
lel. For each class one can choose via cross-validation the optimal regularization and
power on the initially obtained MKL weights.
4 Output Kernel based MKL/MTL
By considering the set of semantic concepts in the ImageCLEF Photo one can expect
weak relations between many of them. Some of them can be established deterministi-
cally such as season labels like Spring necessarily require the photo to be an outdoor
shot. Others might be present in a statistical sense: photos showing Park Garden tend
to be rather calm instead of active, however the latter is possible. The extent of activ-
ity might depend on the dataset. The total number of concepts is however prohibitive
for manual modeling of all relations. One principled approach for exploiting such re-
lations is multi-task learning [4,19] which attempts to transfer information between
concepts. Classical Multi-Task Learning (MTL) has two shortcomings: firstly, it often
scales poorly with the number of concepts and samples. Secondly, kernel-based MTL
leads to symmetric solutions, which implies that poorly recognized concepts can spoil
classification ratesof betterperforming classes. Thework in [16]tackles bothproblems.
It formulates a decomposable approximation which can be solved as a set of separate
MKL problems. Thus it shares the scalability limits of MKL approaches. Secondly,
the formulation as an approximation permits asymmetric information transfer between