ArticlePDF Available

The moving query window for shot boundary detection at TREC-12

Authors:

Abstract and Figures

Digital video is widely used in multimedia databases and requires effective retrieval techniques. Shot bound-ary detection is a common first step in analysing video content. The effective detection of gradual transitions is an especially difficult task. Building upon our past research work, we have designed a novel decision stage for detection of gradual transitions. Its strength lies particularly in the accurate detection of gradual tran-sition boundaries. In this paper, we describe our mov-ing query window method and discuss its performance in the context of the trec-12 shot boundary detection task. We believe this approach is a valuable contribu-tion to video retrieval and worth persueing in the fu-ture.
Content may be subject to copyright.
The Moving Query Window for Shot Boundary Detection at
trec-12
Timo Volkmer S.M.M. Tahaghoghi Hugh E. Williams James A. Thom
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V
Melbourne, Australia, 3001
{tvolkmer,saied,hugh,jat}@cs.rmit.edu.au
Abstract
Digital video is widely used in multimedia databases
and requires effective retrieval techniques. Shot bound-
ary detection is a common first step in analysing video
content. The effective detection of gradual transitions
is an especially difficult task. Building upon our past
research work, we have designed a novel decision stage
for detection of gradual transitions. Its strength lies
particularly in the accurate detection of gradual tran-
sition boundaries. In this paper, we describe our mov-
ing query window method and discuss its performance
in the context of the trec-12 shot boundary detection
task. We believe this approach is a valuable contribu-
tion to video retrieval and worth persueing in the fu-
ture.
1 Introduction
Humans perceive their environment almost entirely by
audio-visual means. Video captures both acoustic and
visual information, and with increasing computational
power and network bandwidth, digital video applica-
tions have become widespread. We believe that video
will continue to gain importance as an information car-
rier.
Most video applications share the need for efficient re-
trieval of archived video data. To retrieve video footage
effectively, we must know its content and index it. This
is commonly performed by annotating sections sequen-
tially with textual information [12]. This is a tedious
and expensive process. Moreover, human observation
of video is subjective and prone to error. Automatic
techniques for video content analysis are required.
The basic semantic element in video is the shot [7],
formed by a sequence of often similar frames. Selected
frames (key-frames) can be indexed to represent each
shot and allow retrieval [6]. A query using example
frames may then return all shots containing similar
key-frames. The first step in this process is often shot
boundary detection, where the video content is sepa-
rated into distinct shots.
We distinguish between different types of the transi-
tions that delimit shots. These are classified as abrupt
transitions or cuts, and gradual transitions, which in-
clude fades, dissolves and spatial edits [9]. Informa-
tional video tends to contain more cuts, whereas en-
tertainment material is more likely to be edited using
fades, dissolves, and other gradual transitions.
According to Lienhart [14] cuts, dissolves, and fades
account for more than 99% of all transitions across all
types of video. The trec-10 [25] and trec-11 [24]
video collections support this observation.
The cut detection quality of existing systems is compa-
rable to the quality of human detection [2]. However,
gradual transitions are more difficult to detect using
automated systems [17].
In this paper we describe our moving query window
method that we apply to the problem of shot boundary
detection in the trec-12 Video Retrieval Evaluation
(trecvid).
1.1 Related work
Research on transition detection in digital video can
be categorised into methods that use compressed video
and methods that use uncompressed video. Koprinska
et al. [13] provide an obverview of exiting approaches.
Techniques in the compressed domain use one or more
features of the encoded footage, such as Discrete Co-
sine Transform (dct) coefficients, Macro Blocks (mb),
or Motion Vectors (mv) [4, 19, 30]. These algorithms
are often efficient because the video does not need to
be fully decoded. However, using the encoded features
directly can result in lower precision [5]. The exact
transition boundaries may not be identifiable, or grad-
ual transitions may not be distinguishable from object
movement [13].
Most approaches working on uncompressed video use
frame difference as a measure for shot boundary detec-
tion. Within one shot the difference between successive
frames is usually small. When a sufficient dissimilarity
between neighbouring frames is detected, this is inter-
preted as a cut. The same scheme is applied cumula-
tively for gradual transition detection.
There are several methods to measure the difference
between frames. Pixel-by-pixel comparison is an ob-
vious approach. Here, the number of changing pixels
and often the level in change is measured. While this
method shows good results [5], it is computationally in-
tensive and sensitive to camera motion, camera zoom,
and noise.
The majority of research groups use histograms to rep-
resent frame content. Differences between frames are
calculated using vector-distance measures [32]. Global
histograms suffer from their lack of spatial information.
Several researchers try to overcome this by introduc-
ing local histograms [18, 26] or adding other techniques
such as edge detection [15, 23].
Approaches that use clustering algorithms [8, 16] mon-
itor frame similarity, and identify frames that belong
to a scene change. Adjacent frames from these are
marked as gradual transitions and remaining frames
are detected as cuts.
Methods based on transition modelling employ math-
ematical models of video data to represent different
types of transitions, and often work without the need
for thresholds [10, 31]. Transitions are identified based
on similarity to the underlying mathematical model.
Koprinska et al. [13] report that these approaches are
often sensitive to object and camera motion.
Qu´enot et al. [21, 22] use direct image comparison for
cut detection. To reduce false positives, motion com-
pensation is applied before image comparison. A sepa-
rate flash detection module is used to further reduce
false positives. Gradual transitions are detected by
checking whether the pixel intensity in adjacent frames
approximately follows a linear, non-constant function.
Recent work in trecvid indicates that histograms
seem to be the favoured way to represent feature
data. Adams et al. [1] propose a video retrieval sys-
tem which employs a combination of three-dimensional
rgb colour histograms and localised edge gradient his-
tograms for shot boundary detection. Recent frames
are held in memory to compute adaptive thresholds.
The system proposed by Hua et al. [11] uses global his-
tograms in the rgb colour space. Pickering et al. [20]
use a detection algorithm which employs localised rgb
colour histograms. Each frame is divided into nine
blocks and the median between the nine block distances
is computed. A transition is detected when the median
distance exceeds a fixed threshold.
Wu et al. [29] propose a shot boundary detection algo-
rithm which calculates frame-to-frame difference based
on luminance information and histogram similarity in
the rgb colour space. Flash and motion detectors are
used to reduce false positives.
In the next section, we explain our approach to video
segmentation. This is an extension of our work first
presented at trec-11 [24]. In Section 3, we discuss
features and parameters used. Section 4 reviews the
results of our algorithm on the trec-12 shot bound-
ary evaluation task. We conclude with Section 5, dis-
cussing possible improvements and future work.
2 The moving query window technique
Our algorithm applies the concept of query-by-example
(qbe), popular in content-based image retrieval [27],
to shot boundary detection. The observation that all
transitions except cuts stretch over several adjacent
frames suggests that we ought to evaluate a set of
frames together. To cater for this, we employ a moving
query window, consisting of two equal-sized half win-
dows on either side of the current frame.
pre frames post frames
moving window current frame
Figure 1: Moving query window with a half window size (hws) of 5; the five frames before and the five frames
after the current frame form a collection on which the current frame is used as a query example. Figure
reproduced from [28].
Pre-frames Current frame Post-frames NumPreFrames
AAAAAAAAAA AAAAAAAAAAA 5
AAAAAAAAAA A AAAAABBBBB 7
AAAAAAAAAA ABBBBBBBBBB 10
AAAAAAAAAA BBBBBBBBBBB 0
AAAAAABBBB B BBBBBBBBBB 2
Figure 2: As the moving window traverses an abrupt transition, the number of pre-frames in the N
2frames most
similar to the current frame varies significantly. This number (NumPreFrames) rises to a maximum just before
an abrupt transition, and drops to a minimum immediately afterwards. Figure reproduced from [28].
As shown in Figure 1, the current frame is not part
of the actual window. It is used as the query example
against which the other frames of the query window
can be compared. We refer to the frames forming the
preceding half window as pre-frames, and to the frames
following the current frame as the post-frames.
We evaluate frame similarity by employing one-
dimensional global colour histograms to represent
frame content. We calculate inter-frame distances us-
ing the Manhattan—also called city block—distance
measure [3]. The difference between the current frame
and its surrounding frames is usually small. This
changes when a transition is passed.
2.1 Abrupt transitions
To detect cuts, we rank the frames of the moving
query window based on their similarity to the current
frame [27]. The frame most similar to the current frame
is ranked highest.
Figure 2 shows how a cut can be detected using similar-
ity ranking. Shortly before the current frame passes a
cut, the half window holding the pre-frames is entirely
filled with frames of the previous shot (Shot A). Some
of the post-frames belong to the second shot (Shot B).
Since the current frame still belongs to Shot A, frames
of Shot B will be ranked lower than those of Shot A.
When the last frame of Shot A becomes the current
frame, all pre-frames will be Shot A frames, whereas
all post-frames will be from Shot B. As a result, the
number of pre-frames ranked in the top half window
reaches a maximum.
This effect is reversed when the query window advances
by one frame and the current frame is the first frame
of Shot B. Here, the number of pre-frames ranked in
the top half window drops significantly to near zero.
The graph in Figure 3 shows the variation in the num-
ber of pre-frames ranked in the top half of the query
window. The diagram shows a 200-frame interval. The
four known cuts and one known gradual transition are
marked above the graph. Cuts are clearly indicated
1000 1020 1040 1060 1080 1100 1120 1140 1160 1180 1200
HWS
(10)
UB
(8)
LB
(2)
pre frames
cut
dissolve
Figure 3: Plot of the number of pre-frames in the top half of the ranked results for a 200-frame interval. The
five transitions present in this interval are indicated above the plot. The parameters used for hws, the upper
threshold (ub) and the lower threshold (lb) are listed between parentheses. Figure reproduced from [28].
1000 1020 1040 1060 1080 1100 1120 1140 1160 1180 1200
ratio
0
2
4
6
8
pre/post ratio
ratio average
threshold
cut
dissolve
Figure 4: Ratio between the average frame distances to query-example of pre-frames and post-frames (pre/post
ratio), plotted for a 200-frame interval. We apply a dynamic threshold, calculated using a moving average and
standard deviation.
by a rapid decrease in the number of pre-frames from
above the Upper Bound (ub) to a value below the
Lower Bound (lb).
2.2 Gradual transitions
As can be seen from Figure 3, a gradual transition
cannot be as clearly identified by the ranking approach
as a cut. This is mainly because gradual transitions
stretch over several adjacent frames. This observation
led us to develop a different approach for detecting
gradual transitions.
We monitor the average distance of frames within the
query window on either side of the current frame.
These values are used to build the ratio of differ-
ences between pre-frames and post-frames (pre/post
ratio). Figure 4 shows the pre/post ratio for the same
200-frame interval as previously used in Figure 3.
Gradual transitions are indicated by a peak in the
pre/post ratio, usually at the end of the transition.
The slopes of these peaks are often moderately steep,
as opposed to the very quick rise found for cuts.
We also calculate the average sum of distances for the
1000 1020 1040 1060 1080 1100 1120 1140 1160 1180 1200
(%)
8
6
4
2
0
average distances
moving average
threshold
cut
dissolve
Figure 5: Average frame distances to query example, computed over the entire query window. This is shown for
the same 200-frame interval as in Figure 4. The threshold is calculated based on moving average and standard
deviation.
entire query window that we refer to as the average
frame distance. Figure 5 shows the average frame dis-
tance. The curve has a maximum in the middle of the
gradual transition, which is typical for a dissolve. We
can identify gradual transitions by monitoring for these
patterns using peak detection and plateau-detection.
The dashed lines in Figure 4 and Figure 5 mark the
adaptive lower and upper thresholds we use for decision
making.
Our algorithm allows a number of frames bordering
the current frame to be omitted from the collection as
shown in Figure 6. This results in a gap on either side
of the current frame, which we refer to as the Demil-
itarised Zone (dmz). This helps to reduce the sensi-
tivity against camera motion and noise that may be
caused by compression artifacts.
2.3 Algorithm details
In this section we explain parameters of our algorithm
in detail and discuss the steps for detection of cuts and
gradual transitions. The decision-making process for
gradual transition detection differs from the process
used for cut detection. This reflects the different na-
ture of gradual transitions and abrupt transitions. We
can employ a smaller part of the moving query window
for detecting cuts and perform detection of both cuts
and gradual transitions within a single pass. The be-
haviour of our algorithm can be controlled through the
following parameters:
Half Window Size (hws): This is the number of
frames on one side of the query window. This
does not include the current frame or the frames
in the dmz.
Demilitarised Zone depth (dmz): This specifies
the number of frames on each side of the current
frame which are not evaluated as part of the
query window. Figure 6 shows an example.
Lower Bound (lb): This is the lower threshold used
for cut detection. As illustrated in Figure 3, a
possible cut is indicated when the number of pre-
frames falls below this threshold.
Upper Bound (ub): This upper threshold is used for
cut detection in connection with lb. Whenever
the number of pre-frames rises above ub, a possi-
ble cut is detected.
The parameters ub and lb only affect cut detection.
hws and dmz are independently set in each decision
stage.
Detection of cuts
As we advance through the video, we monitor the num-
ber of pre-frames that are ranked in the top half of all
frames in the moving query window. We refer to this
number as NumPreFrames. We also calculate the slope
of the NumPreFrames curve. When we near an abrupt
HWS
DMZ
current frame
Figure 6: Moving query window with a half window size (hws) of 8, and a demilitarised zone (dmz) of three
frames on either side of the current frame; the eight frames preceding and the eight frames following the current
frame form a collection, against which the current frame is used as a query example. Figure reproduced from [28].
transition, NumPreFrames rises above the upper thresh-
old (ub). After we pass a cut, NumPreFrames generally
drops below the lower bound (lb). This is also re-
flected in the slope of the NumPreFrames curve. This
slope takes on a large positive value before a cut, and
a large negative value just after passing it.
In many video clips, variations in lighting conditions
lead to false cut detection. This may occur, for ex-
ample, when the camera focus follows an object from
the shade into bright sunlight. To avoid such false
positives, we evaluate the average distances of the
top-ranked half-window and the bottom-ranked half-
window to the current frame. The top-ranked frames
must have less than half the average distance to the
current frame than the bottom ranked frames.
Training experiments using the trec-10 and trec-11
shot boundary evaluation sets have shown that the
ranking criteria may sometimes be satisfied even when
smaller changes between frames occur. We have ob-
served this, for example, in close up shots where all
frames are nearly identical but contain changes in lit-
tle details. We can in such cases reduce the number
of false positives further by requiring a significant dif-
ference between the last pre-frame and the first post-
frame. We introduce an absolute threshold at 25% of
the maximum possible inter-frame distance to express
this.
In accordance with the trecvid decision that a cut
may stretch over up to six frames [25], we allow a fuzzi-
ness of four consecutive frames for the above criteria to
be satisfied. We can summarise that our algorithm re-
ports an abrupt transition when the following criteria
are fulfilled at any point within four adjacent frames:
1. The slope of the NumPreFrames curve has a large
negative value,
2. The top ranked half window frames have less than
half the average distance to the query frame than
the bottom ranked frames, and
3. The last pre-frame is more than 25% different from
the first post-frame.
If all these conditions are satisfied, we report a cut with
the current frame being the first frame of the new shot.
Detection of gradual transitions
Here, we monitor the pre/post ratio, as described in
Section 2.2, and use a peak detection algorithm to find
the local maximum in the curve whenever the upper
threshold is exceeded, as shown in Figure 4. We hold
the pre/post ratios of the past 60 frames in a history
to detect the local minimum preceding the peak. We
record this minimum as the start of a possible transi-
tion, and the local maximum as its end.
Gradual transition detection is performed after cut de-
tection within a single pass. We check that no cut has
previously been detected within the suspected grad-
ual transition. Two heuristics are employed to further
reduce false hits. We compute the area between the
pre/post ratio curve and its upper threshold, as well as
the area between the average frame distance curve and
the upper threshold. Both values must exceed a certain
fixed threshold. We empirically determined a suitable
value for this threshold by training on the trec-10 and
trec-11 test sets. Peaks in the curves covering smaller
areas are usually caused by normal scene activity.
We compute a dynamic threshold for peak detection
using a moving average, calculated over a number of
frames that are held in a buffer. The actual threshold
is computed as the standard deviation plus the moving
average.
Run Vector hws used for Lower Upper dmz used for
length cuts/gradual transitions Bound (lb) Bound (ub) cuts/gradual transitions
1 48 6/20 2 5 0/2
2 48 6/26 2 5 0/2
3 96 6/20 2 5 0/2
4 96 6/26 2 5 0/2
5 192 6/20 2 5 0/2
6 192 6/26 2 5 0/2
Table 1: Parameters used for each submitted run. Global colour histograms in hsv colour space have been used
for all runs. We employ only a subset of the entire query window for cut detection.
3 Selection of features and parameters
We have tested our algorithm on the shot boundary
evaluation sets of trec-10 [25] and trec-11 [24]. For
the runs submitted to trec-12, we used the parame-
ters shown in Table 1.
The effectiveness of the segmentation process is evalu-
ated using the standard information retrieval measures
of recall and precision. Recall measures the fraction of
all reference transitions that are correctly detected:
R=Transitions correctly reported
Total reference transitions
Precision represents the fraction of detected transitions
that match the reference data:
P=Transitions correctly reported
Total transitions reported
These measures can be used for both abrupt and grad-
ual transitions. With trec-11, two additional mea-
sures were introduced to evaluate how well reported
gradual transitions overlap with reference transitions.
These are Frame Recall (FR) and Frame Precision
(FP), defined as:
F R =Frames correctly reported in detected transition
Frames in reference data for detected transition
F P =Frames correctly reported in detected transition
Frames reported in detected transition
3.1 Features and Parameters
In the runs submitted to trec-12, we have used one-
dimensional global hsv colour histograms to represent
frames.
The best results for cut detection we have achieved so
far use a feature derived from the Daubechies wavelet
transform [27]. Despite this, we decided to further in-
vestigate using the hsv feature, as hsv feature data
can be extracted at relatively small computational cost.
We found that histograms of 16 to 64 bins per compo-
nent (48–192 bins total) perform well.
Table 1 shows all relevant parameter combinations used
for the six runs submitted to trec-12. Working to-
wards our goal of a universal shot boundary detection
algorithm, we have tried to keep parameter variations
as small as possible.
The main parameter settings used for the six submitted
runs are shown in Table 1. The lower bound and the
upper bound are used only for cut detection. We have
found that our approach detects cuts best when setting
hws to 6, lb to 2 and ub to 5. The demilitarised zone
was set to 0 for cut detection in all runs.
4 Results
Results for cut detection are very good, considering the
fact the we have used relatively simple histograms.
Our results for gradual transition detection in trec-12
are promising but they are not competitive enough to
score among the top performing groups. We have good
control over the choice of parameters, and an accept-
able number of false positives, resulting in good preci-
sion.
However, recall for gradual transitions is average and
too low for practical use. Frame recall and frame pre-
cision of our system are among the best.
0 20 40 60 80 100
Average Recall (%)
0
20
40
60
80
100
Average Precision (%)
1
2
34
56
12
34
56Cuts
Gradual transitions
Figure 7: Performance of the moving query window for cuts and gradual transitions on the trec-12 shot
boundary detection task, measured by Recall and Precision.
5 Conclusion and Future Work
In this paper, we have presented our enhanced moving
query window method and applied it to the trec-12
shot boundary detection task. Separate decision stages
for abrupt and gradual transitions are applied during
a single pass through the video clip.
Recall and precision for all transitions are in reach of
the best performing groups. The ranking approach
works well for cut detection, but we see much room
for improvement in gradual transition detection.
When applying our algorithm to the trec-11 shot
boundary detection task, we experienced many false
positives. We believe that this is partially due to the
lower video quality which resulted in a very noisy slope
of the average frame distance curve. It might be rea-
sonable to employ pre-filtering stages, or a second fea-
ture, such as edge-tracking.
We plan to focus on improvements for gradual transi-
tion detection and replace all fixed thresholds by adap-
tive methods to increase recall and make the system
more applicable to different types of video.
We will explore using three-dimensional histograms
and localised histograms to consider spatial informa-
tion. We aim to experiment with different feature
spaces, and to investigate the application of wavelet
transform features for gradual transition detection.
We believe that our approach can be developed further
and that it, despite the current limitations, constitutes
a useful contribution to video retrieval.
References
[1] B. Adams, A. Amir, C. Dorai, S. Ghosal, G. Iyen-
gar, A. Jaimes, C. Lang, C.-Y. Lin, A. Natsev,
M. Naphade, C. Neti, H. J. Nock, H. H. Per-
muter, R. Singh, J. R. Smith, S. Srinivasan, B. L.
Tseng, T. V. Ashwin, and D. Zhang. IBM Re-
search TREC-2002 video retrieval system. In
E. M. Voorhees and L. P. Buckland, editors,
NIST Special Publication 500-251: Proceedings of
the Eleventh Text REtrieval Conference (TREC
2002), Gaithersburg, MD, USA, 19–22 November
2002.
[2] P. Aigrain, H. J. Zhang, and D. Petkovic. Content-
based representation and retrieval of visual media:
A state-of-the-art review. Multimedia Tools and
Applications, 3(3):179–202, 1996.
[3] D. Androutsos, K. N. Plataniotis, and A. N.
Venetsanopoulos. Distance measures for color im-
age retrieval. In Proceedings of the IEEE Interna-
tional Conference on Image Processing (ICIP’98),
0 20 40 60 80 100
Average Frame Recall (%)
0
20
40
60
80
100
Average Frame Precision (%)
1
23
45
6
Gradual transitions
Figure 8: Performance of the moving query window for gradual transitions on the trec-12 shot boundary
detection task, as measured by Frame Recall and Frame Precision.
volume 2, pages 770–774, Chicago, IL, USA, 4–7
October 1998.
[4] F. Arman, A. Hsu, and M.-Y. Chiu. Image pro-
cessing on encoded video sequences. Multimedia
Systems Journal, 1(5):211–219, March 1994.
[5] J. S. Boreczky and L. A. Rowe. Comparison of
video shot boundary detection techniques. Journal
of Electronic Imaging, 5(2):122–128, April 1996.
[6] R. Brunelli, O. Mich, and C. M. Modena. A survey
of the automatic indexing of video data. Journal
of Visual Communication and Image Representa-
tion, 10(2):78–112, 1999.
[7] A. Del Bimbo. Visual Information Retrieval. Mor-
gan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2001.
[8] B. G¨unsel, A. M. Ferman, and A. M. Tekalp. Tem-
poral video segmentation using unsupervised clus-
tering and sematic object tracking. Journal of
Electronic Imaging, 7(3):592–604, 1998.
[9] A. Hampapur, R. Jain, and T. Weymouth. Digital
video segmentation. In Proceedings of the ACM
International Conference on Multimedia, pages
357–364, San Francisco, CA, USA, October 1994.
[10] A. Hampapur, R. Jain, and T. E. Weymouth.
Production model based digital video segmenta-
tion. Journal of Multimedia Tools and Applica-
tions, 1(1):9–46, 1995.
[11] X.-S. Hua, P. Yin, H. Wang, J. Chen, L. Lu, M. Li,
and H.-J. Zhang. MSR-Asia at TREC-11 video
track. Technical report, Media Computing Group,
Microsoft Research Asia, 2002.
[12] F. M. Idris and S. Panchanathan. Review of im-
age and video indexing techniques. Journal of Vi-
sual Communication and Image Representation,
8(2):146–166, June 1997.
[13] I. Koprinska and S. Carrato. Temporal video seg-
mentation: A survey. Journal of Signal Process-
ing: Image Communication, 16(5):477–500, 2001.
[14] R. W. Lienhart. Comparison of automatic shot
boundary detection algorithms. Proceedings of the
SPIE; Storage and Retrieval for Image and Video
Databases VII, 3656:290–301, December 1998.
[15] R. W. Lienhart. Reliable transition detection in
videos: A survey and practitioner’s guide. Inter-
national Journal of Image and Graphics (IJIG),
1(3):469–486, July 2001.
[16] C.-C. Lo and S.-J. Wang. Video segmentation us-
ing a histogram-based fuzzy C-Means clustering
algorithm. Journal of Computer Standards and
Interfaces, 23:429–438, 2001.
[17] S. Marchand-Maillet. Content-based video re-
trieval: An overview. Technical Report 00.06,
CUI - University of Geneva, Geneva, Switzerland,
2000.
[18] A. Nagasaka and Y. Tanaka. Automatic Video In-
dexing and Full-Video Search for Object Appear-
ances. Visual Database Systems, 2:113–127, 1992.
[19] J. Nam and A. H. Tewfik. Dissolve transi-
tion detection using B-Splines interpolation. In
A. Del Bimbo, editor, IEEE International Confer-
ence on Multimedia and Expo (ICME), volume 3,
pages 1349–1352, New York, NY, USA, 30 July
2 August 2000.
[20] M. J. Pickering, D. Heesch, R. O’Callaghan,
S. R¨uger, and D. Bull. Video retrieval using global
features in keyframes. Technical report, Depart-
ment of Computing, Imperial College, 180 Queen’s
Gate, London, UK, 2002.
[21] G. M. Qu´enot. TREC-10 shot boundary detec-
tion task: CLIPS system description and evalu-
ation. In E. M. Voorhees and D. K. Harman,
editors, NIST Special Publication 500-250: Pro-
ceedings of the Tenth Text REtrieval Conference
(TREC 2001), pages 142–151, Gaithersburg, MD,
USA, 13–16 November 2001.
[22] G. M. Qu´enot, D. Moraru, L. Besacier, and
P. Mulhem. CLIPS at TREC-11: Experiments in
video retrieval. In E. M. Voorhees and L. P. Buck-
land, editors, NIST Special Publication 500-251:
Proceedings of the Eleventh Text REtrieval Con-
ference (TREC 2002), Gaithersburg, MD, USA,
19–22 November 2002.
[23] G. M. Qu´enot and P. Mulhem. Two systems
for temporal video segmentation. In Proceed-
ings of the European Workshop on Content Based
Multimedia Indexing (CBMI’99), pages 187–194,
Toulouse, France, 25–27 October 1999.
[24] A. F. Smeaton and P. Over. The TREC-2002
video track report. In E. M. Voorhees and L. P.
Buckland, editors, NIST Special Publication 500-
251: Proceedings of the Eleventh Text REtrieval
Conference (TREC 2002), pages 69–85, Gaithers-
burg, MD, USA, 19–22 November 2002.
[25] A. F. Smeaton, P. Over, and R. Taban. The
TREC-2001 video track report. In E. M. Voorhees
and D. K. Harman, editors, NIST Special Pub-
lication 500-250: Proceedings of the Tenth Text
REtrieval Conference (TREC 2001), pages 52–60,
Gaithersburg, MD, USA, 13–16 November 2001.
[26] D. Swanberg, C.-F. Shu, and R. Jain. Knowledge
guided parsing in video databases. Proceedings of
the SPIE; Image Storage and Retrieval Systems,
1908:13–21, February 1993.
[27] S. M. M. Tahaghoghi, J. A. Thom, and H. E.
Williams. Multiple-example queries in conten-
based image retrieval. In Proceedings of the
Ninth International Symposium on String Pro-
cessing and Information Retrieval (SPIRE’2002),
pages 227–240, Lisbon, Portugal, September 2002.
[28] S. M. M. Tahaghoghi, J. A. Thom, and H. E.
Williams. Shot boundary detection using the
moving query window. In E. M. Voorhees and
L. P. Buckland, editors, NIST Special Publication
500-251: Proceedings of the Eleventh Text RE-
trieval Conference (TREC 2002), pages 529–538,
Gaithersburg, MD, USA, 19–22 November 2002.
[29] L. Wu, X. Huang, J. Niu, Y. Xia, Z. Feng, and
Y. Zhou. FDU at TREC2002: Filtering, Q&A,
web and video tracks. Technical report, Fudan
University, Shanghai, China, 2002.
[30] B. L. Yeo and B. Liu. Rapid scene analysis on
compressed video. IEEE Transactions on Circuits
and Systems for Video Technology, 5(6):533–544,
December 1995.
[31] D. Zhang, W. Qi, and H. J. Zhang. A new shot
boundary detection algorithm. Lecture Notes in
Computer Science, 2195:63–70, 2001.
[32] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar.
Automatic partitioning of full-motion video. Mul-
timedia Systems Journal, 1(1):10–28, June 1993.
... In this paper we present our novel approach to gradual transition detection in video. Our moving query window technique [26,27] caters for the fact that gradual transitions usually extend over several frames by evaluating the average inter-frame distance in a set of frames, rather than examining only individual frames. Moreover, we compute thresholds dynamically to increase effectiveness across different types of video content. ...
... We have previously proposed a technique for effective cut detection. The moving query window technique [26] performs comparisons on a set of frames to detect abrupt transitions. As we proceed through the video, we take each frame in turn as a pivot, and consider a fixed-size window of frames encompassing each pivot or current frame. ...
... Post-frames PrePostRatio The effectiveness of this approach for cut detection has been demonstrated with the collections of the trec Video Retrieval Evaluation [26,27,29]. However, without modification, this scheme is less effective on gradual transitions. ...
Conference Paper
Full-text available
Segmenting digital video into its constituent basic semantic entities, or shots, is an important step for effective management and retrieval of video data. Recent automated techniques for detecting transitions between shots are highly effective on abrupt transitions. However, automated detection of gradual transitions, and the precise determination of the corresponding start and end frames, remains problematic. In this paper, we present a gradual transition detection approach based on average frame similarity and adaptive thresholds. We report good detection results on the TREC video track collections - particularly for dissolves and fades - and very high accuracy in identifying transition boundaries. Our technique is a valuable new tool for transition detection.
... This was due to the fact that it involved less steps. In the same vein, Volkmer [23] proposed a method for detecting gradual transition using an average frame similarity. Due to the fact that gradual changes involve several frames, this approach catered for evaluation of the average inter-frame distance in a set of frames, rather than examining only individual frames while also using adaptive thresholding to increase effectiveness across different types of video content. ...
... Due to the fact that gradual changes involve several frames, this approach catered for evaluation of the average inter-frame distance in a set of frames, rather than examining only individual frames while also using adaptive thresholding to increase effectiveness across different types of video content. Furthermore, Volkmer [23] argued that frames in a gradual change should not be examined individually but rather an average distance between sets of pre-frames and post-frames to the current frame should be computed. This procedure resulted in two values, which were referred to as PrePostRatioused to detect gradual changes. ...
... At trecvid 2005 we used a two-pass implementation of our moving query window algorithm [17, 23]. This did not exhibit any improvement over the one-pass algorithm that we used in 2004 [21, 22]. ...
... This method has proven to work very effectively [17] with features derived from the Daubechies wavelet transform [3] ; however, computation of wavelets is expensive. In 2003, to reduce computational cost, we used the ranking-based method in combination with one-dimensional global histograms in the relatively simple hsv colour space [23]. Results were strong, although not as good as those obtained with the wavelet feature. ...
Article
Full-text available
Run overview We participated in the shot boundary detection and video search tasks. This page provides a summary of our experiments: Shot Boundary Detection Our approach uses the moving query window tech-nique [17, 18, 21, 22]. We applied the system that we used in 2004 [22] and varied algorithm parameters around the optimal settings that we obtained with training runs on the trecvid 2005 test set. The results of all runs are close together at a high standard in terms of recall and precision, but we could not match the performance that we achieved in previ-ous years. In particular, precision for gradual transi-tion detection has suffered significantly in all runs. We use localised hsv colour histograms with 16 re-gions and 32 bins per dimension. Our system uses dif-ferent weights for histogram regions when computing frame differences. We observe decreased performance compared to pre-vious years because of many falsely reported gradual transitions. Cut detection performance suffered due to high brightness levels in most video clips. Some of the fixed thresholds that we use do not allow the algorithm to adapt well to different types of footage. Our system performed best on videos that are similar to the 2004 and 2005 test sets, such as those from cnn or nbc. Video Search We combine visual high-level concept terms with an in-dex built from the speech transcripts in an early-fusion approach. We experimented with expanding concept terms by lexical semantic referencing before combining them with the speech transcripts. All runs are fully automatic search runs. Table 1 shows an overview of the submitted runs. We used different inverted indexes built using text from speech transcripts (T); semantic high-level concept terms (S); and terms from expanding the concept terms using lex-ical semantic referencing (E). In Run 2, the system automatically used a text-based index (T) for person-x queries or a combined index (T+S+E) for other queries.
... We use our moving query window approach previously presented at trecvid [8] [11] [12]. However, this year we have used a new implementation of this method and experimented with different histogram representations. ...
... This has been proven to work very effectively [8] with features derived from the Daubechies wavelet transform [3]; however, computation of wavelets is expensive . In 2003, to reduce computational cost, we used the ranking-based method in combination with one-dimensional, global histograms in the HSV colour space [12]. Results were strong, but not as good as those obtained with the wavelet feature. ...
Article
Full-text available
Run overview We participated in the Shot Boundary Detection task. This page provides a summary of: (1) the approaches tested in the submitted runs; (2) differences in results between the runs; (3) the overall relative contribution of the techniques; and, (4) our overall conclusions. 1. Our approach to shot boundary detection uses the moving query window technique [8, 9, 10]. In 2005, we have applied a new implementation of our sys-tem and experimented with different feature rep-resentations. We submitted ten runs using only vi-sual features, exploring different colour histogram representations. The first two runs were used as a baseline in which we have used our system as it was applied in 2004, with the settings as in our best runs of that year (Run 3 and Run 5) [11]. An overview of all submitted runs is shown in Table 1. Feature hws Threshold Run cuts gradual cuts gradual method 1 HSVl HSVl 6 14 old 2 HSVl HSVl 8 16 old 3 HSVl HSVl 8 14 new 4 HSVl HSVl 10 16 new 5 HSV3 HSVl 8 14 new 6 HSV3 HSVl 10 16 new 7 HSV3 HSV3 8 14 old 8 HSV3 HSV3 10 16 old 9 HSVl HSV3 8 14 old 10 HSVl HSV3 10 16 old Table 1: Overview over our ten submitted runs in 2005, the features that we have used, and variations in the settings for the half-window size (hws). Runs 1 and 2 were carried out with our 2004 system and serve as a baseline. 2. In our submissions we have tested a new imple-mentation of our system that is designed as a two-pass algorithm, rather than the single-pass algorithm used in previous years. We have applied different combinations of a localised HSV histogram (HSVl) feature and a true three-dimensional colour histogram (HSV3) representation in Run 3 through to Run 10. We have also implemented a new dynamic threshold computation that was applied in Run 3 through to Run 6. This comes into effect during gradual transition detection and is designed to minimise the number of false positives in clips with few transitions. 3. Our three-dimensional colour histogram expresses each colour as a point in the three-dimensional space. While this representation has been shown to produce promising results in content-based image retrieval, performance gains are often outweighed by computational overhead. Due to the type of footage in 2005, the new threshold computation has had only very limited influence on the results. 4. The baseline runs which performed very well in 2004 were again our best runs. Despite im-proved results during training on the 2003 test set with our new implementation, we could not achieve improvements on the 2005 test set. We see the reasons for this mainly in the limited train-ing that we were able to undertake with our new two-pass algorithm and the different feature com-binations. pre frames post frames moving window current frame frame number 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 6 7 8 Figure 1: Moving query window with a half-window size (hws) of 5. The five frames before and the five frames after the current frame form a collection on which the current frame is used as a query example.
... Mihai Datcu et al used the motion compensation vector to estimate the motion of different objects [23]. Error in the motion estimation is used to differentiate the [24]. Separate decision stages for abrupt and gradual transitions are applied during a single pass through the video clip. ...
Article
Full-text available
The multimedia databases and the need to intuitively handle their content, which meets the user's requirements with the available content based video indexing and retrieval technology, are the main focus of the research in the field of multimedia and computer vision. The researchers mainly focus on the problem of bridging the "semantic gap" between a user's need for meaningful retrieval and the current technology for computational analysis and description of the media content. It takes into account both the high complexity of the real-world implementation and user's need for conceptual video retrieval and browsing. In this paper, the initial work done in this area during last 15 years has been categorized in three generations. The key technologies in each generation are reviewed and characterized based on the standard parameters. It is found that in first and second generation all techniques are semantic less techniques, but in third generation, techniques based on semantics have been evolved. But still most of the techniques are in their infancy and require lots of research for their use in daily applications. In last a solution is proposed for a general purpose multimedia mining application which caters to the needs of the different types of domains.
... This is a problem we all face with our home video collections; it is a far more pressing issue for content providers and defence intelligence organisations. This paper incorporates work from trec-11 (Tahaghoghi, Thom & Williams 2002) and trec-12 (Volkmer, Tahaghoghi, Thom & Williams 2003). ...
Conference Paper
Full-text available
Segmentation is the rst step in managing data for many information retrieval tasks. Automatic audio transcriptions and digital video footage are typically continuous data sources that must be pre-processed for segmentation into logical entities that can be stored, queried, and retrieved. Shot boundary detec- tion is a common low-level video segmentation tech- nique, where a video stream is divided into shots that are typically composed of similar frames. In this paper, we propose a new technique for nding cuts | abrupt transitions that delineate shots | that combines evidence from a xed size window of video frames. We experimentally show that our techniques are accurate using the well-known trec experimental testbed.
Conference Paper
Dominant motion detection is essential for automated video analysis. Traditional methods usually set an empirical threshold to detect pan/tilt/zoom. In this paper, we propose a novel approach of automatic detecting video dominate motion which is parameter-free. Based on the motion trajectories of feature points, the distribution of motion is estimated by kernel density estimation. A Kullback-Leibler divergence based K-Nearest Neighbor classifier is used to classify the motion into pan/tilt/zoom etc category. Experiments results on both standard test video and real world video show this method significantly out-performs a baseline parametric method for dominate motion detection in both precision and recall.
Conference Paper
Person retrieval and indexing in video sequences is a challenging task for many multimedia applications. This paper proposes a new method that index the person based on the similarity. Firstly, the persons in a shot are detected and tracked through face detector and continuously adaptive mean shift algorithm. Then mid-level features such as clothes colors and voice are applied to represent the person. An unsupervised cluster method is performed to cluster the person for further indexing. At last, the cluster is validated and refined by the voice feature. Experimental results of proposed method are presented, and the method has been found to be effective.
Article
Full-text available
In this paper we address the issue of image database retrieval based on color using various vector distance metrics. Our system is based on color segmentation where only a few representative color vectors are extracted from each image and used as image indices. These vectors are then used with vector distance measures to determine similarity between a query color and a database image. We test numerous popular vector distance measures in our system and find that directional measures provide the most accurate and perceptually relevant retrievals.
Conference Paper
Full-text available
We address the issue of image database retrieval based on color using various vector distance metrics. Our system is based on color segmentation where only a few representative color vectors are extracted from each image and used as image indices. These vectors are then used with vector distance measures to determine similarity between a query color and a database image. We test numerous popular vector distance measures in our system and find that directional measures provide the most accurate and perceptually relevant retrievals
Article
A large number of shot boundary detection, or equivalently, transition detection techniques have been developed in recent years. They all can be classified based on a few core concepts underlying the different detection schemes. This survey emphasizes those different core concepts underlying the different detection schemes for the three most widely used video transition effects: hard cuts, fades and dissolves. Representative of each concept one or a few very sound and thoroughly tested approaches are present in detail, while others are just listed. Whenever reliable performance numbers could be found in the literature, they are mentioned. Guidelines for practitioners in video processing are also given.
Article
The Media Computing Group of Microsoft Research Asia participated in all the three tasks of Video tracks of TREC-11, including automatic Shot Boundary Determination, Semantic Feature Extraction and Video Search. A robust shot detector was proposed. Systems for semantic feature extraction and video retrieval which integrated many recent research results of this group's are presented.
Article
Today a considerable amount of video data in multimedia databases requires sophisticated indices for its effective use. Manual indexing is the most effective method to do this, but it is also the slowest and the most expensive. Automated methods have then to be developed. This paper surveys several approaches and algorithms that have been recently proposed to automatically structure audio–visual data, both for annotation and access.Copyright 1999 Academic Press.
Article
Various methods of automatic shot boundary detection have been proposed and claimed to perform reliably. Although the detection of edits is fundamental to any kind of video analysis since it segments a video into its basic components, the shots, only few comparative investigations on early shot boundary detection algorithms have been published. These investigations mainly concentrate on measuring the edit detection performance, however, do not consider the algorithms' ability to classify the types and to locate the boundaries of the edits correctly. This paper extends these comparative investigations. More recent algorithms designed explicitly to detect specific complex editing operations such as fades and dissolves are taken into account, and their ability to classify the types and locate the boundaries of such edits are examined. The algorithms' performance is measured in terms of hit rate, number of false hits, and miss rate for hard cuts, fades, and dissolves over a large and diverse set of video sequences. The experiments show that while hard cuts and fades can be detected reliably, dissolves are still an open research issue. The false hit rate for dis-solves is usually unacceptably high, ranging from 50% up to over 400%. Moreover, all algorithms seem to fail under roughly the same conditions.
Article
This paper presents a novel approach to processing encoded video sequences prior to complete decoding. Scene changes are easily detected using DCT coefficients in JPEG and MPEG encoded video sequences. In addition, by analyzing the DCT coefficients, regions of interest may be isolated prior to decompression, increasing the efficiency of any subsequent image processing steps, such as edge detection. The results are currently used in a video browser and are part of an ongoing research project in creating large video databases. The procedure is detailed with several examples presented and studied in depth.