Exploiting Temporal and Inter-concept Co-occurrence Structure to Detect High-Level Features in Broadcast Videos
ABSTRACT In this paper the problem of detecting high-level features from video shots is studied. In particular, we explore the possibility of taking advantage of temporal and interconcept co-occurrence patterns that the high-level features of a video sequence exhibit. Here we present two straightforward techniques for the task: N-gram models and clustering of temporal neighbourhoods. We demonstrate the usefulness of these techniques on data sets of the TRECVID high-level feature detection tasks of the years 2005-2007.
-
Citations (0)
-
Cited In (0)
Page 1
Exploiting Temporal and Inter-Concept Co-Occurrence Structure to Detect
High-Level Features in Broadcast Videos
Ville Viitaniemi, Mats Sj¨ oberg, Markus Koskela, Jorma Laaksonen
Adaptive Informatics Research Centre
Helsinki University of Technology, Finland
{ville.viitaniemi,mats.sjoberg,markus.koskela,jorma.laaksonen}@tkk.fi
Abstract
In this paper the problem of detecting high-level fea-
tures from video shots is studied. In particular, we explore
the possibility of taking advantage of temporal and inter-
concept co-occurrence patterns that the high-level features
of a video sequence exhibit. Here we present two straight-
forward techniques for the task: N-gram models and clus-
tering of temporal neighbourhoods. We demonstrate the
usefulness of these techniques on data sets of the TRECVID
high-level feature detection tasks of the years 2005-2007.
1. Introduction
Extracting semantic concepts from multimedia data has
attracted a lot of research attention recently [3]. The main
aim is to facilitate semantic indexing and concept-based re-
trieval of multimedia content. The leading principle is to
build semantic representations by extracting intermediate
semantic levels from low-level features. Recently, the in-
troduction of large-scale multimedia ontologies (e.g. [2])
as well as large manually annotated datasets have enabled
generic analysis of multimedia content as well as an in-
crease in multimedia lexicon sizes by orders of magnitude.
One of the tasks in the annual TRECVID video re-
trieval evaluation [6] is to detect the presence of predefined
high-level features (HLF)—such as “sports”, “meeting” or
“urban”—in broadcast videos that are already partitioned
into shots. The predominant approach to detecting HLFs is
to treat the problem as a generic supervised learning prob-
lem. This automatic approach is scalable to large numbers
of features. The training data is used to learn independent
models of different concepts over low-level feature distribu-
tions.
It is almost self-evident that the HLFs in videos have
temporal structure, for example the HLFs—also called
concepts—of subsequent shots are likely to be similar. The
alternation of concepts might also exhibit some characteris-
tic temporal patterns. The existence of such temporal struc-
ture is evidenced in the mutual information between the de-
tections of the same concept at close-by time instants, eval-
uated in [7] for a time lag of one shot and by ourselves also
for longer time spans (not described here).
However, it is not so self-evident that the concept de-
tection accuracy can be improved by using temporal char-
acteristics of the video streams. There might be several
hurdles. When detecting features of novel video material,
only detector outputs of probabilistic nature are available,
not binary oracles indicating the actual presence of the fea-
tures. Often the detections of some features are very inac-
curate. In their earlier study Yang and Hauptmann [7] ar-
rived at the result that temporal models conditional to ora-
cledetectionssignificantlyimprovedthedetectionresultbut
with real detections the temporal model brought very little
improvement. They pointed out the possibility that tempo-
rally close shots might usually be similar by their low-level
features, and the corresponding detector outputs thus very
correlated. The resulting “miss-one-miss-all” phenomenon
would make temporal smoothing less effective.
Another nearly obvious characteristic of the high-level
featuresisthatthefeaturesexhibitedbyavideoshotaremu-
tually dependent. For instance, the feature “snow” almost
always implies “outdoor”, whereas concepts like “sports”
and “weather forecast” are practically mutually exclusive.
It has been experimentally found very beneficial to exploit
concept co-occurrence for HLF detection [4, 5].
Despite the potential obstacles, it is reasonable to believe
that with an adequate amount of training material the esti-
mation of an accurate generative model behind the obser-
vations should lead to optimal results. However, the train-
ing material used in TRECVID campaigns in years 2005-
2007 is limited when it comes to modelling the temporal
and co-occurrence structure of the videos. Even though the
total number of shots is rather large, within a given video
the temporal and co-occurrence structures are likely to be
similar. Thus the number of independent training samples
Ninth International Workshop on Image Analysis for Multimedia Interactive Services
978-0-7695-3130-4/08 $25.00 © 2008 IEEE
DOI 10.1109/WIAMIS.2008.50
12
Page 2
for the temporal and co-occurrence models would be much
smaller, perhaps of the order of the number of videos in
the training material. This in turn is of order 100 in the
TRECVID data, certainly not a large number of training
samples to train a complex model.
In this paper we propose a set of simple post-processing
techniques to model the temporal and inter-concept co-
occurrences as an add-on, when the concepts have already
been detected without considering these issues. We com-
bine the N-gram technique for intra-concept temporal mod-
elling, and a simple clustering technique that takes ad-
vantage of inter-concept temporal and instantaneous co-
occurrences. These techniques can be seen as a superset
of the intra-concept bigram technique of [7]. We demon-
strate the usefulness of the proposed techniques within two
data sets: the TRECVID 2005/2006 HLF task development
videos and all of TRECVID 2007 HLF task videos.
2. Techniques for exploiting temporal and
inter-concept co-occurrences
In this section we describe techniques that take into ac-
count the temporal and inter-concept co-occurrence as a
post-processing step. The techniques operate on a stream
of K-tuples corresponding the concept detector outputs for
the sequential video shots, where K is the number of the
concepts detected. The presented methods thus ignore the
absolute timing and duration of the video shots, preserving
justtheorderingoftheshots. Methodologically, wepropose
two types of approaches: N-gram models and clusterings of
temporal neighbourhoods.
2.1. N-gram models
The N-gram model was applied to each concept individ-
ually. In the following, cn∈ {0,1} is indicator of the oc-
currence of the concept to be detected at time instant n, and
sn∈ R is the output of the corresponding concept detector.
Hn(N) denotes the recursive prediction history known at
time instant n, extending N − 1 steps backwards in time:
Hn(N) = {ˆ p(cn−i|sn−i,Hn−i(N))}N−1
Using this notation, we can write the recursive N-gram
model as
i=1
(1)
ˆ p(cn|sn,Hn(N)) ∝ ˆ p(sn|cn)ˆ p(cn|Hn(N)).
In this recursive model
?
(2)
ˆ p(cn|Hn(N)) =
cn−1
···
?
cn−N+1
N−1
?
(3)
p0(cn|cn−1···cn−N+1)
i=1
ˆ p(cn−i|sn−i,Hn−i(N))
Here p0is the marginalised N-gram probability estimated
from the training data. The N-gram model was initialised in
the beginning of each video by using models of lower order,
e.g. bigram model at the second time instant. The condi-
tional distributions of detector outputs ˆ p(sn|cn) were mod-
elled as exponential distributions whose parameters were
estimated from the training data by means of maximum
likelihood.
In addition to this causal model, we also formed the cor-
responding anticausal model that is obtained by reversing
the time flow. The causal and anticausal models were com-
bined by logarithmic averaging of the model outcomes.
2.2. Clusterings of temporal neighbour-
hoods
The N-gram model was augmented with information Cn
that was obtained by clustering the baseline detector out-
puts within temporal neighbourhoods around the predic-
tion time instant n. The clustering was based simultane-
ously on all the K concepts, i.e. 36 or 39 in the exper-
iments reported here. The LBG algorithm with 16 clus-
ters was used.The cluster information was combined
with the N-gram model by estimating the N-gram model
separately for each cluster, resulting in models for for
p0(cn|Cn,cn−1···cn−N+1). The cluster-specific detector
outcome distribution ˆ p(sn|Cn,cn) was modelled as a linear
interpolation between the global logistic model and a logis-
tic model estimated for each cluster separately.
Several different clusterings were combined by taking
logarithmic averages of the detection probability estimates
based on each clustering. The different clusterings resulted
from neighbourhoods of different time spans.
3. Experiments
A set of experiments were performed with two data sets:
the development videos of the high-level feature (HLF) de-
tection task of TRECVID 2005/2006, and both the develop-
ment and test data of the TRECVID 2007 HLF task. The re-
sults on these two data sets seem to point to the same direc-
tion: there is some advantage of using the methods outlined
in Section 2. However, for different data sets somewhat dif-
ferent techniques turned out to be the most beneficial.
For both sets of experiments, we post-processed the de-
tector outputs of our PicSOM video analysis system that
were used when participating in the TRECVID HLF de-
tection tasks [5, 1]. For this post-processing we employed
the N-gram and clustering techniques of Section 2 in vari-
ous combinations and with varying model parameters. The
post-processors as well as the original detectors operate in
a supervised manner, i.e. both data sets were partitioned
into test and training subsets. The estimates of the detector
13
Page 3
outputs for the training set required by the post-processing
techniques were obtained by using cross-validation.
Each post-processed detection stream was evaluated for
the individual concepts using the same metric that was
used in the TRECVID evaluation of the corresponding
year: (non-interpolated) average precision (AP) for the
2005/2006 data, and inferred average precision (infAP) [8]
forthe2007data. Forthe2005/2006developmentsetallthe
39 detected concepts were evaluated. For the 2007 data we
detected 36 concepts, but only 20 were evaluated for the test
data by the TRECVID organisers. In the following, we re-
port some highlights of the detection results. Many interest-
ing properties appear in the concept-wise detection results,
but due to the space limitations here we mainly concentrate
on the performances averaged over all of the evaluated con-
cepts.
3.1. The 2005/2006 development data
ThefirstsetofexperimentswasperformedonTRECVID
2005/2006 HLF development data (the same data was used
in both years). The data consisted of 137 videos segmented
into approximately 44 000 video shots, consisting mostly
of news broadcasts in English, Arabic, and Chinese. The
data was split approximately evenly into training and test
halves. The detector outputs for the training half of the data
were estimated by 7-fold cross validation. This data was
completely annotated with 39 concepts.
Table 3.1 exemplifies some concept-wise detection re-
sults in the test set. Each row of the table contains the base-
line detection result for that concept and lists the improve-
ment percentages for the various alternative post-processors
(not all post-processors are tabulated). The methods “2-
gram” and “4-gram” use solely the temporal structure
within detections of individual concepts, “co” utilises clus-
tering based on the instantaneous co-occurrence of con-
cepts, whereas in “cl” the clustering is based on a larger
temporal neighbourhood. Next, “cl+co” combines the pre-
vious two techniques and “cl+co+2-gram” adds a bigram
model to this. The last two columns show the performances
of the methods that were best in the training and test sets
(“oracle”), respectively. On the last row of the table we
report the average performance (MAP) over all the 39 con-
cepts.
In many cases, the proposed post-processing techniques
are beneficial. Often the combination of techniques is more
useful than one single technique. The relative ordering of
the techniques varies from concept to concept. Within indi-
vidual concepts the improvement percentages show larger
variation than in the average performance. Both of these
two characteristics are partly explained by the statistical
fluctuations in a small sample. However, partially this tes-
tifies of the genuine differences of concepts as the perfor-
mances in training and test sets are rather well correlated.
This facilitates method selection based on training set per-
formance. However, this correlation is not perfect, as evi-
denced by the difference between the last two columns of
the table.
3.2. TRECVID 2007 experiments
In the 2007 TRECVID experiments [1] a different video
data set consisting of a wide variety of Dutch television
programming, such as news magazines, documentaries and
educational shows was used. The full data set contains
219 videos segmented into approximately 36 000 video
shots. In this experiment we generated 15 separate post-
processors. Each post-processor was trained using 6-fold
cross-validation on the training set.
we tried two different methods of selecting the best post-
processor: method 1: selecting the one with maximum per-
formance in the training set, or method 2: by a separate
validation experiment training with one half of the original
training set and validating in the other.
In Fig. 1, the mean infAP results over all 20 evaluated
concepts are summarised. The first four bars (B1 through
O1) show the results of using visual features as the base-
line, the last four bars (B2 through O2) use both visual
and textual features.Our textual features did not work
well with the 2007 data. The first bar in both groups (B1,
B2) show the baseline performance of our PicSOM system,
without taking advantage of temporal and inter-concept co-
occurrences. Bars T1 and T2 show the results when using
post-processors selected with method 1 in addition to the
baseline PicSOM features. Bars V1 and V2 correspond to
validation by splitting the training set, i.e. method 2. Fi-
nally bars O1 and O2 show the performance that would
be achieved with an optimal (“oracle”) selection of post-
processors.The median of all submitted runs from all
groups is also shown for comparison.
From Fig. 1 it is quite evident that the temporal and inter-
concept co-occurrence methods strongly improve upon the
baseline results of PicSOM. Of the two post-processor se-
lection methods, the method employing a validation set
seemed to perform slightly better.
For each concept
4. Conclusions
In this paper we reported a set of experiments that clearly
indicate that using both temporal and inter-concept co-
occurrences can be beneficial for detection of concepts from
news videos. Temporal information can be found both in
the occurrence history of a single concept and the occur-
rence history of other concepts. These temporal informa-
tion sources and the inter-concept co-occurrence informa-
tion seem to be partially complementary for many of the
14
Page 4
Table 1. Examples of concept detection results using the TRECVID 2005/2006 data.
conceptbaseline AP2-gram4-gramco
face 0.790 -0.6% -1.4%+12.1%
maps0.241 +1.7%+1.5%+39.8%
people-marching0.024+38.7% +81.5% +57.1%
court 0.067+1.8%+1.9%-47.1%
outdoor0.385-4.4%-6.5%+12.7%
crowd0.151-0.7%-2.9%+30.7%
military0.067+69.0%+96.5%+10.0%
average0.1760.1840.1850.185
clcl+co
+11.1%
+35.8%
+64.7%
-11.5%
+22.3%
+36.0%
+41.1%
0.191
cl+co+2-gram
+9.0%
+32.0%
+208%
-21.6%
+21.2%
+50.3%
+123%
0.202
best in train
+12.1%
+32.0%
+208%
-21.6%
+22.3%
+36.0%
+123%
0.203
oracle
+12.1%
+39.8%
+208%
+27.0%
+22.3%
+55.4%
+123%
0.211
+5.8%
+0.9%
+49.4%
+27.0%
+20.0%
+28.9%
+47.1%
0.188
0.04
0.05
0.06
0.07
0.08
0.09
B1 T1 V1 O1B2 T2 V2 O2
median
Figure 1. Mean InfAP values for the experi-
ments on the TRECVID 2007 data.
detected concepts. For simplicity, we regarded here the
original detections as an ordered symbol stream. Taking
the detailed temporal structure—e.g. shot durations—into
account could provide still more of useful information.
The presented experiments point to a different direction
than the experiments by Yang and Hauptmann [7] with the
TRECVID 2005/2006 development data. They did not find
the temporal smoothing to be useful in their experiments
with TRECVID data. However, our results are in line with
the improvement reported in [4].
The usefulness of the outlined individual technical alter-
natives varies strongly between concepts. Some techniques
are even harmful for some concepts. The two data sets of
the experiments were somewhat different in this respect.
We presented two methods of selecting between the tech-
niques. Even if these methods seem to work to some extent,
the results are not fully satisfactory, as evidenced by com-
parison with oracle selections of techniques. It is expected
that the results could be improved by using more rigorous
cross-validation.
The employed techniques are rather heuristically chosen
and rudimentary. It seems likely that the information could
be better exploited by using more rigorous and principled
techniques. For example, temporal techniques routinely ap-
plied in speech recognition could be worth investigating
also in this context. However, one has to keep in mind that
the small number of temporal co-occurrence training sam-
ples in our experiments (and in the TRECVID tasks) seri-
ously limits the complexity of models that can be estimated
from the data.
It would be rather straightforward to refine also the tech-
niques presented here. One obvious direction to look at
is selection of variables on which the clusterings are per-
formed. This could be done separately for each target con-
cept based on the observed dependencies between the con-
cept detectors. Currently the clusterings are dominated by
reliably detected concepts that do not necessarily give much
information about the target concepts.
References
[1] M. Koskela, M. Sj¨ oberg, V. Viitaniemi, J. Laaksonen, and
P. Prentis. PicSOM experiments in TRECVID 2007. In Proc.
of the TRECVID 2007 Workshop, Gaithersburg, MD, USA,
Nov. 2007.
[2] M. Naphade, J. R. Smith, J. Teˇ si´ c, S.-F. Chang, W. Hsu,
L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale con-
cept ontology for multimedia. IEEE MultiMedia, 13(3):86–
91, 2006.
[3] M. R. Naphade and T. S. Huang. Extracting semantics from
audiovisualcontent: The finalfrontier inmultimediaretrieval.
IEEE Transactions on Neural Networks, 13(4):793–810, July
2002.
[4] S. Petrov, A. Faria, P. Michaillat, A. Stolcke, D. Klein, and
J. Malik. Detecting categories in news video using acoustic,
speech and image features. In TRECVID Online Proceedings.
TRECVID, Nov. 2006.
[5] M. Sj¨ oberg, H. Muurinen, J. Laaksonen, and M. Koskela.
PicSOM experiments in TRECVID 2006.
TRECVID 2006 Workshop, Gaithersburg, MD, USA, Nov.
2006.
[6] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns
and TRECVid. In MIR ’06: Proc. of ACM MIR ’06, pages
321–330, New York, NY, USA, 2006. ACM Press.
[7] J. Yang and A. Hauptmann. Exploring temporal consistency
for video retrieval and analysis. In Proc. of ACM SIGMM Int.
Workshop on MIR, Santa Barbara, CA, Oct. 2006.
[8] E. Yilmaz and J. A. Aslam. Estimating average precision with
incomplete and imperfect judgments. In Proc. of CIKM2006,
Arlington, VA, USA, Nov. 2006.
In Proc. of the
15
View other sources
Hide other sources
-
Available from Markus Koskela · 21 Jan 2013
-
Available from tkk.fi