Shot Boundary Detection by a Hierarchical Supervised Approach
ABSTRACT Video shot boundary detection plays an important role in video processing. It is the first step toward video-content analysis and content-based video retrieval. We develop a hierarchical approach for shot boundary detection based on the assumption that hierarchy helps to take decisions by reducing the amount of indeterminate transitions. Our method consists in first detecting abrupt transitions using a learning-based approach, then non-abrupt transitions are split into gradual transitions and normal frames. We describe in this paper, a machine learning system for shot boundary detection. The core of this system is a kernel-based SVM classifier. We present some results obtained for shot extraction TRECVID 2006 Task.
- SourceAvailable from: Arnaldo de Albuquerque Araújo[show abstract] [hide abstract]
ABSTRACT: In most of video shot boundary detection algorithms, proposed in the literature, several parameters and thresholds have to be set in order to achieve good results. In this paper, to get rid of parameters and thresholds, we explore a supervised classification method for video shot segmentation. We transform the temporal segmentation into a class categorization issue. Our approach defines a uniform framework for combining different kinds of features extracted from the video. Our method does not require any pre-processing step to compensate motion or post-processing filtering to eliminate false detected transitions. The experiments, following strictly the TRECVID 2002 competition protocol, provide very good results dealing with a large amount of features thanks to our kernel-based SVM classification method.Graphics, Patterns and Images, SIBGRAPI Conference on. 01/2006;
Conference Proceeding: Correlation based video-dissolve detection[show abstract] [hide abstract]
ABSTRACT: We propose a dissolve detection method based on the analysis of a dissolve modeling error that is the difference between an ideally modeled dissolve curve without any correlation and an actual variance curve with a correlation. First, candidate regions are extracted by using the characteristics of a parabola that is downward convex, then the candidate region will be verified based on a dissolve modeling error. If a dissolve modeling error on a candidate region is less than a threshold adaptively determined based on the variances between the candidate regions and the target correlation, the candidate region should be a dissolve region with a correlation less than the target correlation. By considering the correlation between neighbor scenes, the proposed method is able to be a semantic scene-change detector. The proposed algorithm was tested on various types of data and its performance proved to be more accurate and reliable when compared with other commonly used method dissolve modeling error. The proposed algorithm consists of two steps. First, the candidate dissolve regions are extracted using the characteristics of the first and second derivate of a variance curve. In the second step, the candidate regions are verified based on a dissolve modeling error. If the dissolve modeling error for a candidate region is less than a threshold defined by a dissolve modeling error with a target correlation, the candidate region is determined as a dissolve region with a lower correlation than the target correlation, which can be given application-dependently by user or can be used as a control factor of video parsing. The proposed algorithm was tested on a variety of data and the performance proved to be more accurate and reliable when compared with other commonly used method.Information Technology: Research and Education, 2003. Proceedings. ITRE2003. International Conference on; 09/2003
- [show abstract] [hide abstract]
ABSTRACT: A Abstract: This paper describes a feature based, two-pass algorithm for scene break detection in video sequences. In the first pass, the camera cuts are detected on the low resolution video using histogram comparison. The gradual changes are then detected on the high resolution video based on the editing characteristics. Experimental results indicate that the proposed discontinuous cut, dissolve and fade detection method is both computationally low and robust to local changes within the sequences.Image Processing, International Conference on. 2:498.
Shot Boundary Detection by a Hierarchical Supervised Approach
G. Cámara-Chávez∗†, F. Precioso∗, M. Cord∗, S. Phillip-Foliguet∗, A. de A. Araújo†
∗Equipe Traitement des Images et du Signal-ENSEA / CNRS UMR 8051
6 avenue du Ponceau 95014 Cergy-Pontoise - France
†Federal University of Minas Gerais - DCC
Av. Antônio Carlos 6627 31270-010 - MG - Brazil
Keywords: shot boundary detection, cut, gradual transition, dissolve, fade.
Abstract – Video shot boundary detection plays an important
role in video processing. It is the first step toward video-
content analysis and content-based video retrieval. We develop
a hierarchical approach for shot boundary detection based
on the assumption that hierarchy helps to take decisions
by reducing the amount of indeterminate transitions. Our
method consists in first detecting abrupt transitions using
a learning-based approach, then non-abrupt transitions are
split into gradual transitions and normal frames. We describe
in this paper, a machine learning system for shot boundary
detection. The core of this system is a kernel-based SVM
classifier. We present some results obtained for shot extraction
TRECVID 2006 Task.
The first step for video-content analysis, content based
video browsing and retrieval is the partitioning of a video
sequence into shots. A shot is defined as an image sequence
that presents continuous action which is captured from
a single operation of single camera. The joining of two
shots can be of two types : abrupt transitions (ATs) and
gradual transitions (GTs). According to the editing process
of transitions, 99% of all edits fall into one of the following
three categories: cuts, fades, or dissolves . An AT is an
instantaneous change from one shot to another. In fade-in
a shot gradually appears from a constant image while in
fade-out a shot gradually disappears from a constant image.
In dissolve the current shot fades out while the next shot
A common approach to detect GTs is to compute the
difference between two adjacent frames (with color, mo-
tion, edge and/or texture features) and to compare this
difference to a preset threshold , . The main draw-
back of these approaches lies in selecting an appropriate
threshold for different kind of videos. Another way is to
see this problem as a categorization task. Most of the works
based on this approach consider a low number of features
because of computational and classifier limitations. Then
to compensate this lack of information, they need pre-
processing steps, like motion compensation  or illumi-
Presently, we limit our attention to ATs, fades and
dissolves in our work. We propose to use a supervised
kernel-based SVM (Support Vector Machine) technique
with a hierarchical classification procedure composed of
two stages. In the first stage, we extract features computed
on frame differences in order to detect ATs. The features
are classified by a kernel-based SVM classifier, because of
its well-known performances in statistical learning informa-
tion retrieval . Kernel functions can efficiently deal with
a large number of features. With many features it is possible
to better describe the frame information and to better handle
illumination changes and fast movement problems. Thus
the pre-processing steps are not necessary anymore. Once
the video sequence is segmented into AT-free segments
we can execute the second stage. We extract new features
in each segment in order to detect possible GTs without
using any sliding window like most of the authors do
. Then we classify the candidate frames with the same
kernel-based SVM classifier. We present satisfying results
obtained at TRECVID 2006 shot boundary extraction task.
The paper is organized as follows: Section 2 describes the
features and the learning scheme for sharp cut and gradual
transition extraction. Section 3 is devoted to the interactive
video retrieval from key-frames.
2. ABRUPT TRANSITION SEGMENTATION
Cuts generally correspond to an abrupt change between
two consecutive frames in the sequence. Automatic detec-
tion is based on the information extracted from the video
(brightness, color distribution, motion, edges, etc.).
Cut detection between shots with little motion and con-
stant illumination is usually done by looking for sharp
brightness changes. However, brightness changes cannot be
easily related to transition between two shots, in the pres-
ence of continuous object motion, or camera movements,
or change of illumination.
Thus, we need to combine different and more com-
plex visual features to avoid such problems. We extract,
for every frame in the video stream one feature vector
concatenating the following visual information: Several
color histograms (RGB, HSV and opponent color), Zernike
moments (Zm), Fourier-Mellin moments (Fm), Projection
histograms (horizontal and vertical) and Phase correlation
Then a pairwise similarity measure is calculated. We test
different distance metrics: L1 norm, cosine dissimilarity,
histogram intersection and χ2distance. Finally, the dis-
similarity feature vector (distance for each type of feature:
color histogram, moments and projection histograms) of
each pair of frames is used as an input to the classifier .
3. GRADUAL TRANSITION SEGMENTATION
The visual patterns of many gradual transitions are not
as clearly or uniquely defined as that of abrupt transitions.
978-961-248-029-5/07 © 2007 UM FERI2007 IWSSIP & EC-SIPMCS, Slovenia
Among the numerous types of gradual transitions, dissolve
is considered as the most common one, but also the most
difficult to detect, particularly on real videos.
3.1 Dissolve detection
Once the video sequence is segmented into cut-free
segments, we process a three-step dissolve detection: a
coarse gradual transition pattern detection based on curve
matching, a refined level based on an improved gradual
transition modeling error and a learning level for classify-
ing gradual transitions from not gradual transitions.
Since gradual transitions have very variable lengths,
going from 6 to almost 100 frames, the number of frames
to be considered for the detection is difficult to specify, as
in . Our machine learning approach allows us to avoid
using a sliding window. Furthermore, gradual transitions
are not as clearly, or uniquely defined, as ATs. The dissolve
modeling error, proposed by Won et al., is the difference
between an ideal dissolve shape and the shape of temporal
variation curve of the illumination variance.
For the first step, we extend the dissolve modeling error
to the Effective Average Gradient (EAG) . Indeed, in
presence of dissolve, both illumination variance and EAG
can be matched with a parabola. However, not all the video
segments, corresponding to a parabola shape along these
curves, are dissolves.
Thus, at the second step, we process a verification among
all the validated video segments from dissolve modeling
error step, using a modification of the Double Chromatic
Difference test (DCD). Proposed by Yu et al. , the DCD
feature can differentiate dissolve from zoom, pan and wipe:
F(|f(x,y,S) + f(x,y,E)
where t represents the time, S ≤ t ≤ E, f(x,y,S) is the
starting frame and f(x,y,E) the ending one of a possible
dissolve interval. F(.) is a fixed threshold function.
We propose a modification of this well known descriptor
reducing highly the complexity of its computation. Indeed,
we use projection histograms (1D) instead of the frame
content (2D). Projection histograms allow us not only to
reduce the size of data concerned with DCD test but also
to preserve color and spatial information. For our modified
DCD, the formulation Eq. (1) remains the same if f(x,y,t)
represents the projection histogram at frame t. Furthermore,
we already extracted projection histograms for each frame
in the first step of our shot boundary detector.
Then, in the last step, we classify the remaining video
intervals with our machine learning core, using the same
kernel-based SVM classifier and the following features:
C = argmintDCD(t),t ∈ [S,E], where C represents
the position of the minimum value of the downward-
parabolic shape (normally the center).
1) Ddata: different information extracted from the
dissolve region, the features used are:
a) 2 correlation values, one between S and C, the
other between C and E of the candidate interval;
b) 2 color histogram differences, one between S
and C, the other between C and E of the candidate
c) correlation by blocks of interest in the sequence:
this feature is computed only on the target intervals
and uses the dissolve descriptor [?].
2) DCD features:
a) the quadratic coefficient of the parabola approx-
imating the DCD curve at best ;
b) the “depth” of this parabola, defined as the height
difference between C and S (or E) ;
SCD features: the modified DCD, here we use the
same features used in DCD (previous item),
VarProj: difference of the projection histograms
extracted in the first step (cut detection).
Motion: motion vectors are also extracted in the
first step, when the phase correlation method is com-
puted, for each block we compute the magnitude of the
We concatenate them in one feature vector given as input
to our kernel-based SVM classifier in order to determine
“dissolves” and “non-dissolves” video segment.
3.2 Fade detection
Fade-in and fade-out often occur together as a fade group,
i.e., a fade group starts with a shot fading out to a color
C which is then followed by a sequence of monochrome
frames of the same color, and it ends with a shot fading
in from color C. Fade groups formed by this way are
often referred to as a single fade. Allatar detects fades by
recording all negative spikes in the second derivative of
frame luminance variance curve . The drawback with
this approach is that motion also would cause such spikes.
Lienhart proposes detecting fades by fitting a regression
line on the frame standard deviation curve . Truong
et al. observe the mean difference curve, examining the
constancy of its sign within a potential fade region .
We present further extensions to these techniques.
Since a fade is a special case of a dissolve we can explore
some of the features used for dissolve detection. The salient
features of our two-step fade detection algorithm are the
1) The existence of monochrome frames is a very
good clue for detecting all potential fades, these are
used in our algorithm. In a quick fade, the monochrome
sequence may be compound by a single frame while in
a slower fade it would last up to 100 frames . There-
fore, detecting monochrome frames (candidate region) is
the first step in our algorithm.
2) In the second step, we use a descriptor that char-
acterizes a dissolve, our improved DCD. The variance
curves of fade-out and fade-in frame sequences have
a half-parabolic shape independent of C. Therefore, if
we compute the DCD feature in the region where the
fade-out occurs we will have a parabola shape, the same
principle is applied to detect the fade-in.
3) We also constrain the variance of the starting frame
of a fade-out and the ending of a fade-in to be above
a threshold to eliminate false positives caused by dark
scenes, thus preventing them from being considered as
Some of the techniques used for detecting fades are not
tolerant to fast motion, which produce the same effect
of a fade. DCD feature is more tolerant to motion and
other edition effects or combinations of them. Our modified
DCD feature preserves all the characteristics of the feature
presented in , with the advantage that we reduce the
dimension of the features, from 2D to 1D.
We evaluated our shot extraction module by participating
to TRECVID 2006, to the Shot Boundary Detection (SBD)
Task. Our training set for AT detection consisted of a single
video of 9078 frames (5mins. 2 secs.). This video was
captured from a Brazilian TV-station and is composed by
a segment of commercials. The training video was labeled
manually by ourselves. Our training set for GT consisted
of the corpus of TRECVID-2002 Video Data Set. We used
a SVM classifier and trained it with a Gaussian kernel with
We run our algorithm on the TRECVID-2006 test set and
provide the mean precision and the mean recall obtained. A
good detector should have high precision and high recall.
The test collection comprises about 7.5 hours, including
13 videos for a total size of about 4.64 Gb. The total
number of frames is 597,043 and the number of transitions
is 3785. The collection contains 1844 abrupt transitions,
that represents 48.7% of the total transitions and 1509
dissolves, that represents 39.9% of the total transitions. The
reference data was created by a student at the National
Institute of Standards and Technology (NIST) whose task
was to identify all transitions.
In Table 1, we present the visual feature vectors for
dissolve detection used for the 10 runs of our experiment.
The runs for AT detection are the same presented in .
mdata, DCD, SCD
mdata, DCD, VarProj
mdata, DCD, Motion
mdata, SCD, VarProj
mdata, SCD, Motion
mdata, SCD, VarProj, Motion
Table 1: 10 best combinations of visual features for dissolve
For fade detection we chose a threshold of 200 for the
variance of each frame, if the variance is lower than that
value we consider it as a monochrome frame and a possible
fade. After that, it is necessary to see if the interval presents
two download parabolas, one for fade-in and other for fade-
Fig. 1 provides the team names that participated to
TRECVID 2006 and corresponds to the respective mark-
ers in Figs. 2, 3 and 4. Fig.2 presents the results of
abrupt transitions, for 10 runs with different combinations
of features, different dissimilarity measures, as well as
various kernels for the SVM classification. Fig.3 presents
the results of gradual transition for 10 other runs. The
results show that our system has a better performance than
other learning-based systems that use pre-processing steps
like motion compensation  and post-processing like
flashlight filtering .
Figure 1: Participants of TRECVID 2006 Evaluation
Figure 2: Abrupt transition extraction TRECVID 2006
In Table 2 we show the performance of two runs of
our system for GT detection, measured in recall and
precision; and frame-recall and frame-precision (measures
to evaluate how well reported gradual transitions overlap
with reference transitions). Here we compare the accuracy
of Yu et al.  DCD method (Etis3) and our modified DCD
method (Etis7). We reduce the computational complexity,
from a 2D descriptor to a 1D descriptor, preserving the
performance of the DCD method.
Table 2: Results for DCD features and modified DCD
Figure 3: Gradual transition extraction TRECVID
Our learning approach showed up to be quite robust once
the database involves training sets from different cameras,
with different compress formats or coding, from different
countries, situation, quality, or length. The selected features
kept being relevant and stable to detect shot transitions in
different context and environment.
Fig.4 shows the frame-recall versus the frame-precision.
We can notice that all markers that correspond to our
system’s run are almost at the same place. This ensures that
our best GT detector is also very accurate on the transitions
it finds. All our runs have more or less the same accuracy.
Our results are among the best ones. The density of all our
run markers, regardless the changes made in the system,
prove the capacity of generalization of our kernel-based
In this paper we present our learning-based system for
shot boundary detection with a hierarchical classification
procedure composed of two stages. The first step is ded-
icated for ATs detection based on a machine learning
approach. Then, we seek for GTs inside the shot delimited
by the ATs and fades detected in the first stage. Our
hierarchical learning system let us reduce the GTs to
Figure 4: Gradual transition Frame-Precision / Frame-
Recall TRECVID 2006
two class-problem : fast motion or dissolve. Even though
TRECVID-2002 (GT training set) is of poor quality ,
the performance of our system is good, we are among
the best teams. We improve existing algorithms and detect
automatically the shot boundaries, without setting any
threshold or parameter.
Our method is not only parameter free, but also robust
and stable to training data set. Our method allows to merge
many different and very complex types of video features in
an efficient way, avoiding any tuning or pre-processing, and
providing robust and stable results considering the training
We thank NIST and TRECVID Organisation for allowing us
to present here our algorithm evaluation. The authors are grateful
to MUSCLE Network of Excellence, CNPq and CAPES for the
financial support of this work.
 R. Lienhart, C. Kuhmunch, and W. Effelsberg, “On the detection
and recognition of television commercials,”
Multimedia Computing and Systems (ICMC ’97), pp. 509–516, 1997.
 Rui Jesus, João Magalhães, Alexei Yavlinsky, and Stefan Rüger,
“Imperial college at trecvid,” in TREC Video Retrieval Evaluation
Online Proceedings, 2005.
 Ralph Ewerth, Christian Beringer, Tobias Kopp, Michael Niebergall,
Thilo Stadelmann, and Bernd Freisleben, “University of marburg at
trecvid 2005: Shot boundary detection and camera motion estimation
results,” in TREC Video Retrieval Evaluation Online Proceedings,
 P.-H. Gosselin and M. Cord, “Precision-oriented active selection for
interactive image retrieval,” in International Conference on Image
Processing (ICIP’06), October 2006, pp. 3127–3200.
 G. Cámara-Chávez, M. Cord, F. Precioso, S. Philipp-Foliguet, and
Arnaldo de A. Araújo, “Video segmentation by supervised learning,”
in 19th Brazilian Symposium on Computer Graphics and Image
Processing (SIBGRAPI’06), Oct. 2006, pp. 365–372.
 Jing-Un Won, Yun-Su Chung, In-Soo Kim, Jae-Gark Choi, and
Kil-Houm Park, “Correlation based video-dissolve detection,” in
International Conference on Information Technology: Research and
Education, 2003, pp. 104 – 107.
 H. Yu, G. Bozdagi, and S. Harrington, “Feature-based hierarchical
video segmentation,”in ICIPInternational Conference on Image
Processing (ICIP’ 97), 1997, vol. 2, pp. 498–501.
 R. Lienhart, “Reliable transition detection in videos: A survey and
practitioner’s guide.,” IJIG, vol. 1, no. 3, pp. 469 – 486, 2001.
 Alan Hanjalic, “Shot boundary detection: Unraveled and resolved?,”
IEEE Trans. on Circuits and System for Video Technology, vol. 12,
no. 2, pp. 90–105, 2002.
 A.M. Alattar, “Detecting and compressing dissolve regions in video
sequences with a dvi multimedia image compression algorithm,”
IEEE International Symposium on Circuits and Systems (ISCAS),
vol. 1, pp. 13–16, 1993.
 Ba Tu Truong, Chitra Dorai, and Svetha Venkatesh,
hancements to cut, fade, and dissolve detection processes in video
segmentation,” in MULTIMEDIA ’00: Proceedings of the eighth
ACM international conference on Multimedia, 2000, pp. 219–227.
 Ralph Ewerth, Markus Mühling, Thilo Stadelmann, Ermir Qeli,
Björn Agel, Dominik Seiler, and Bernd Freisleben, “University of
marburg at trecvid 2006: Shot boundary detection and rushes task
results,” in TREC Video Retrieval Evaluation Online Proceedings,
 Nithya Manickman, Neela Sawant, Aman Parnami, Srikanth
Lingamneni, and Sharat Chandran, “Indian institute of technology,
bombay at trecvid 2006,”in TREC Video Retrieval Evaluation
Online Proceedings, 2006.
 M. Cooper,“Video segmentation combining similarity analisys
and classification,” in Proc. of the 12th annual ACM international
conference on Multimedia (MULTIMEDIA ’04), 2004, pp. 252–255.
IEEE Int. Conf. on