Conference PaperPDF Available

Gradual Transition Detection Using Average Frame Similarity


Abstract and Figures

Segmenting digital video into its constituent basic semantic entities, or shots, is an important step for effective management and retrieval of video data. Recent automated techniques for detecting transitions between shots are highly effective on abrupt transitions. However, automated detection of gradual transitions, and the precise determination of the corresponding start and end frames, remains problematic. In this paper, we present a gradual transition detection approach based on average frame similarity and adaptive thresholds. We report good detection results on the TREC video track collections - particularly for dissolves and fades - and very high accuracy in identifying transition boundaries. Our technique is a valuable new tool for transition detection.
Content may be subject to copyright.
Gradual Transition Detection Using Average Frame Similarity
Timo Volkmer S. M. M. Tahaghoghi Hugh E. Williams
School of Computer Science and Information Technology
RMIT University
GPO Box 2476V, Melbourne, Australia 3001
Segmenting digital video into its constituent basic se-
mantic entities, or shots, is an important step for ef-
fective management and retrieval of video data. Recent
automated techniques for detecting transitions between
shots are highly effective on abrupt transitions. How-
ever, automated detection of gradual transitions, and
the precise determination of the corresponding start
and end frames, remains problematic. In this pa-
per, we present a gradual transition detection approach
based on average frame similarity and adaptive thresh-
olds. We report good detection results on the trec
video track collections — particularly for dissolves and
fades and very high accuracy in identifying transi-
tion boundaries. Our technique is a valuable new tool
for transition detection.
1 Introduction
The volume of video content produced daily is ex-
tremely large, and is likely to increase with the ever-
growing popularity of digital video consumer products.
For this content to be usable, it must be easily accessi-
ble. An important first step is to identify and annotate
sections of interest.
Historically, identification and annotation of video
have been performed by human annotators [30, 31].
This is tedious, expensive, and susceptible to error.
Moreover, it relies on the judgement of the human ob-
server. This is inherently subjective, and often incon-
sistent. Automatic indexing methods have the poten-
tial to avoid these problems.
Part of the analysis process is to identify and deter-
mine the boundaries of the basic semantic elements, the
shots [5]. The transition between adjacent shots can
be abrupt a cut — or gradual. The former category
describes a shot change where two consecutive frames
belong to different shots. The latter involves a progres-
sive changeover between two shots using video editing
techniques such as dissolves, fades, and wipes [6].
Gradual transitions are less frequent than cuts, but
are more complex. Lienhart [10] reports that together,
cuts, fades, and dissolves account for approximately
99% of all transitions in all types of video. In the
video collections we use, approximately 70% of the
annotated transitions are cuts, while 26.5% are fades
or dissolves. These collections are discussed in detail
later. It is likely that the proportion of rarer transition
effects will increase as powerful video editing tools en-
ter mainstream use. Nevertheless, fades and dissolves
remain the most common forms of gradual transition,
and their accurate identification is important for effec-
tive video retrieval.
Automatic cut detection approaches have been
shown to be highly effective [1, 18, 27]. Indeed, the
results are comparable to results obtained by human
observers [2]. However, gradual transitions are more
difficult to detect using automated systems [14, 22].
The often subtle changes between frames are hard to
discriminate from changes caused by normal scene ac-
tivity. In particular, camera motion and zoom opera-
tions often confuse detection algorithms.
In this paper we present our novel approach to grad-
ual transition detection in video. Our moving query
window technique [26, 27] caters for the fact that grad-
ual transitions usually extend over several frames by
evaluating the average inter-frame distance in a set of
frames, rather than examining only individual frames.
Moreover, we compute thresholds dynamically to in-
crease effectiveness across different types of video con-
Our results are promising across different test collec-
tions of the Text REtrieval Conference (trec)VIDeo
Retrieval Evaluation (trecvid)
. Weconcludethat
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Current Frame Post FramesPre Frames
Figure 1. An equal number of frames on each side of the current frame the pre-frames and the
post-frames constitute the moving query window. We can optionally specify a Demilitarised Zone
(DMZ); frames falling within the DMZ for a part icular current frame are omitted from the comparisons
for that frame. The DMZ is explained in more detail in Section 5.
our approach constitutes a good basis for effective and
efficient video indexing.
2 Background
Popular shot boundary detection approaches rely on
the property that adjacent frames within one shot are
usually similar. By evaluating inter-frame differences
and searching for significant dissimilarities, transitions
between shots can be detected.
Digitised video is commonly stored compressed in
one of the mpeg
formats. Many automatic tech-
niques for determining shot boundaries use aspects of
the compressed data directly. Koprinska et al. [9] give
an overview of such methods. In this paper, we focus on
techniques that are applied to uncompressed footage.
The majority of approaches to shot boundary de-
tection compute inter-frame distances from the decom-
pressed video. Shot transitions can be detected by
monitoring this distance for significant changes. In
direct image comparison, changes between adjacent
frames are determined on a pixel-by-pixel basis. While
this approach shows generally good results [3], it is
computationally intensive, and also sensitive to camera
motion, camera zoom, and noise. Additional filtering
may be used to address some of these problems [18].
An alternative and more common approach is
to use histograms of frame feature data. Approaches
using global histograms [15, 28, 31] represent each
frame as a single vector, while those using localised
histograms [24] generate separate histograms for sub-
sections of each frame. Inter-frame distances are cal-
culated using often simple vector-distance measures to
compare corresponding histograms [31]. Localised his-
tograms, used in conjunction with additional features
such as edge-detection, perform well when applied in
the trecvid environment [1, 8].
Moving Picture Experts Group:
The twin-comparison algorithm first proposed by
Zhang et al. [31] is the basis of several proposed ap-
proaches for detection of gradual transitions [8, 25, 30].
Here, a low threshold is applied to detect groups of
frames that belong to a possible gradual transition.
The accumulative inter-frame distance is calculated
for these frames. A gradual transition is reported if
the accumulated inter-frame distance exceeds a second,
higher threshold.
Several approaches have been proposed that are
based on the video production model. These employ
internal transition models based on the operation of
video editing systems. One or more features of the
video are monitored for patterns very similar to those
predicted by the internal models [11, 12, 13, 16]. These
approaches show promising results, but we are not
aware of any large-scale evaluation.
Some video segmentation systems also consider fea-
tures such as audio information or captions [7, 17].
These are usually designed for a particular task on
specific types of content, for example the detection of
commercial breaks in television footage.
We have previously proposed a technique for effec-
tive cut detection. The moving query window tech-
nique [26] performs comparisons on a set of frames
to detect abrupt transitions. As we proceed through
the video, we take each frame in turn as a pivot, and
consider a fixed-size window of frames encompassing
each pivot or current frame. This moving window is
comprised of two equal-sized sets of frames preceding
and following the current frame, as illustrated in Fig-
ure 1. All frames in the moving window are ranked
on their histogram similarity to the current frame; the
most similar frame is ranked highest. The number of
frames from the preceding half window that are ranked
in the top half is monitored while advancing through
the video. A cut is reported when this number exceeds
an upper threshold and falls below a lower threshold
within four consecutive frames.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Pre-frames Current frame Post-frames
slowly rising
steeply rising
22 24
Figure 2. An example of a dissolve. Before the transition, the PrePostRatio is minimal. It rises to a
maximum as we proceed through the transition, before falling again afterwards.
The effectiveness of this approach for cut detection
has been demonstrated with the collections of the trec
Video Retrieval Evaluation [26, 27, 29]. However, with-
out modification, this scheme is less effective on grad-
ual transitions. For example, after training on the
trec-10 [23] test collection, we obtain cut detection
quality index values of 91% for blind runs on both the
trec-11 [22] and trec-12 [21] collection. However,
the corresponding values for detection of gradual tran-
sitions are 40% and 35% respectively.
Aiming for a simplified, and more effective detec-
tion scheme for gradual transitions, we have developed
an alternative technique for application in our moving
query window.
3 Gradual Transition Detection with
the Moving Query Window
In this section, we propose a novel extension of our
moving query window approach that permits effective
detection of gradual transitions.
Our method of ranking frames in the query win-
dow works well for abrupt transitions because these
usually show significant inter-frame distances within a
few consecutive frames. Our observations have shown
that this is not usually the case for gradual transitions,
where inter-frame distances are typically smaller. This
results in our approach being far less effective in de-
tecting gradual transitions.
To address this problem, we propose that the frames
in the moving window not be examined individually.
Instead, we define two sets of frames, one each from
either side of the current frame; we refer to the frames
of these two sets as pre-frames and post-frames respec-
tively. For each of the two sets, we determine the dis-
tance between each frame in that set and the current
frame. We then average these intra-set distances, giv-
ing a final value that is the average distance between
that set and the current frame. This computation re-
sults in two values, one each for the pre- and post-frame
sets, and we use the ratio of these values referred to
as PrePostRatio — to detect gradual transitions.
Consider an example: Figure 2 shows a dissolve be-
tween the neighbouring shots A and B. We assume that
the dissolve starts at frame 12 and ends with frame 22.
In the top row, frame 11 is the current frame; it be-
longs to shot A and is the last frame before the tran-
sition starts. Frames 1 to 10 form the pre-frames and
are also from shot A. They are similar to frame 11,
and therefore, their inter-frame distance to the current
frame is relatively low. For this example, let us assume
the average inter-frame distance of the pre-frames to
the current frame has the value 2.
Frames 12 to 21 the post-frames in Figure 2 are
mostly dissolve frames, and therefore relatively dissim-
ilar to the current frame. Hence, the average inter-
frame distance for the post-frames is comparatively
high; let us assume it has the value 10. Given a pre-
frame average of 2 and a post-frame average of 10, the
PrePostRatio of the top row in Figure 2 is
As the current frame moves further into the dissolve,
the ratio rises. This is illustrated in rows two and three
of Figure 2. In the fourth row, frame 22 is the current
frame and also the last frame of the transition. This
frame is likely to be very similar to frames 23 to 32
that belong to shot B, producing a low average inter-
frame distance. For our example, let us take this value
to be 2.
The pre-frames that are formed by frames 12 to 21
are the frames of the dissolve. As we have established
earlier, their average inter-frame distance is high, we
again assume a value of 10. We can now calculate the
PrePostRatio for row four as
= 5. Once the win-
dow exits the transition completely, the ratio usually
reverts to a relatively low value.
We have observed that this behaviour is common for
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
24800 24820 24840 24860 24880 24900 24920 24940 24960 24980 25000
Pre/Post Ratio
Moving Average
Figure 3. Plot of PrePostRatio over a 200- frame interval of video. The dynamic threshold is calculated
from a moving average of the PrePostRatio.
dissolves and fades. By monitoring the PrePostRatio
as we advance through a video clip, we can detect the
minima and maxima that accompany the start and end
of such transitions. Other effects, such as wipes and
page translations, are more complex, and often include
intense motion. Such transitions can also be detected
using our approach, but with reduced effectiveness.
We maintain a history of PrePostRatio values, and
calculate a moving average and standard deviation that
we use to compute a threshold. Detailed analysis of
the PrePostRatio curves indicates that application
of this threshold works well, and caters for varying
levels of the computed ratio across different types of
footage. However, it is sometimes necessary to adjust
the level of this threshold. For example, poor quality
and noisy footage produces many smaller peaks in the
PrePostRatio curve. To reduce false detections caused
by these peaks, we multiply the calculated threshold by
a factor we call the Upper Threshold Factor (utf).
Figure 3 shows the PrePostRatio curve for a 200-
frame segment of a video, along with the correspond-
ing moving average and threshold. A possible gradual
transition is indicated if the PrePostRatio crosses the
threshold. In this case, we determine the position of
the local minimum within the preceding frames. If this
is sufficiently small, a gradual transition is reported
over the interval between these two points.
The most important algorithm parameter influenc-
ing the results is the number of pre- and post-frames
on either side of the current frame, which we refer to as
the Half-Window Size (hws). The number of frames
in the entire query window is then 2 × hws,asshown
in Figure 1. The current frame is not part of the query
We discuss the effect of the parameters on system
performance further in Section 5 along with detailed
results. In the next section we discuss the environment
used to train and test our algorithm.
4 Evaluation Environment
We apply the common Information Retrieval mea-
sures of recall and precision to evaluate effective-
ness [20]. Recall is the fraction of all known transi-
tions that are correctly detected, while precision is the
fraction of reported transitions that match the known
transitions recorded in the reference data.
Additional effectiveness measures designed specifi-
cally for gradual transitions are Frame Recall (fr)and
Frame Precision (fp) [22]. These are defined as follows:
FR =
Frames correctly reported in detected transition
Frames in reference data for detected transition
FP =
Frames correctly reported in detected transition
Frames reported in detected transition
We also calculate a quality index (Q) that penalises
false negatives more heavily than false positives. False
detections are regarded less problematic, as they can
be filtered out in later processing steps [19]:
Q =
= Number of correctly reported transitions
= Number of false detections
= Number of transitions in reference data
We developed our algorithm using the shot bound-
ary detection task subset of the trec-10 video col-
lection. Detailed results of blind runs of trec-12,
including comparison with other approaches, appear
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection Clips Frames Abrupt Gradual
trec-10 18 594 179 2 066 1 037
trec-11 18 545 068 1 466 591
trec-12 13 596 054 2 364 1 012
Table 1. Details of our t est collections.
elsewhere [29]. We have since improved our technique
through further training on the trec-11 and trec-12
test sets. In this paper, we discuss results obtained
with the current approach. Details of all test sets that
we use are shown in Table 1.
The trec-10 and trec-11 collections contain a
variety of documentary and educational cinema and
television footage, some more than fifty years old.
The trec-11 collection also includes amateur video.
The collection used in trec-12 comprises more recent
footage, mostly television news and entertainment pro-
gramming from the period 1998-2002. All three col-
lections contain a large number of annotated abrupt
transitions which we do not consider in this paper, but
have explored in detail elsewhere [27].
The reference data for the collections categorises
gradual transitions into three classes:
Dissolve: One shot is replaced with another by grad-
ually dimming the first shot and gradually increas-
ing the brightness of the second.
Fade-out or fade-in: A fade-in or fade-out can be
considered to be a special case of a dissolve, with
the first or second shot consisting of frames of only
one colour, usually black. A common transition is
a fade-out of one shot, followed by a fade-in of the
Other: This category comprises all other transition ef-
fects that stretch over more than two frames, such
as wipes and pixelation effects, and also artifacts
of imperfect splicing of the original cine film.
Many gradual transitions extend only over a handful
of frames, and are effectively observed as cuts by hu-
man viewers at the standard replay speed of 24 or 30
frames per second. The two known dissolves marked in
Figure 3 are examples of such short transitions, with
an effective length of only three frames each. In ac-
cordance with the trecvid guidelines that a cut may
stretch over up to six consecutive frames [23], we con-
sider such short transitions to be abrupt, rather than
gradual, transitions.
We have experimented with one-dimensional and
three-dimensional histograms using the rgb and hsv
colour spaces [24], and also with a feature based on
the Daubechies wavelet coefficients of the transformed
frame data [4]. We have found that gradual transitions
are best detected with one-dimensional hsv colour his-
tograms using 32 bins per colour component. All re-
sults reported in this paper are for this feature repre-
Table 2 shows results for detecting gradual transi-
tion for each of the three trec collections using algo-
rithm parameters that produce the best performance.
As with most applications, it is generally possible to
trade precision for higher recall. These results show
that the technique performs best for the trec-10 test
collection. We observe that for the two newer trec
collections, and especially for trec-12, recall drops
considerably. This is caused in large part by the ap-
pearance of transitions in rapid succession, which our
algorithm tends to report as a single transition.
The trec-11 collection also contains a large pro-
portion of older footage with low quality, and is the
most challenging for our system. The high number of
false positives has a negative effect on precision and
quality. We cater for this by raising the threshold level
(utf) by 20%, and by applying a dmz of one frame on
either side of the current frame to reduce the effects
of low video quality, camera motion, and compression
artifacts. The demilitarised zone allows frames imme-
diately adjacent to the current frame to not be consid-
ered as part of the pre- and post-frame sets, permitting
less sensitivity in lower-quality footage. This produces
the best compromise between recall and precision for
this collection. Although recall on the trec-12 col-
lection is rather low, precision remains reasonable, and
at 31.9%, the rate of false positives is not unacceptable.
More detailed results are provided in Table 3. Since
our system does not yet distinguish between transi-
tion types, we cannot calculate the individual insertion
rates. As expected, our approach performs better for
dissolves and fades than for other, less common, grad-
ual transition types.
Table 4 shows the frame recall and frame precision
obtained for each collection. We observe very good re-
sults for all types of gradual transitions. Frame recall
for fades in the trec-10 and the trec-11 collections is
relatively low. We find that in these collections, the av-
erage length of fades is 80 and 89 frames respectively,
while the corresponding average in the trec-12 col-
lection is 29 frames. Our implementation is currently
limited to detection of gradual transitions spanning less
than 60 frames.
The values used for the algorithm parameters play
an important part in determining effectiveness. Fig-
ure 4 illustrates the effect of varying the half-window
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection hws dmz utf Recall Precision Quality Deletions Insertions
trec-10 18 1 1.0 83.5% 75.0% 72.4% 16.4% 33.4%
trec-10 18 1 1.2 76.4% 80.7% 69.4% 23.5% 21.1%
trec-11 18 1 1.0 81.7% 56.8% 33.8% 18.2% 144.0%
trec-11 18 1 1.2 64.5% 77.0% 53.4% 22.9% 70.8%
trec-12 14 0 1.0 65.9% 76.4% 56.0% 34.0% 29.6%
trec-12 14 0 1.2 58.9% 82.5% 53.7% 41.0% 15.8%
Table 2. Results of the best runs for gradual transitions for the TREC video collections and t he
parameters used in these runs.
Collection Reference transitions Recall Deletions Insertions
Dissolve Fade Other Dissolve Fade Other Dissolve Fade Other All
trec-10 942 54 41 86.2% 68.5% 70.7% 13.8% 42.6% 24.4% 35.8%
trec-11 510 63 18 78.6% 76.2% 61.1% 21.6% 31.8% 38.9% 67.2%
trec-12 684 116 212 76.9% 47.4% 44.8% 25.1% 52.6% 55.7% 31.9%
Table 3. Results grouped by transition type for the best run on each test collection. We observe much
better performance for dissolves and fades than for other types of gradual transition.
size (hws) on recall, precision, and quality for the
trec-12 collection. With a larger hws, precision in-
creases but recall drops considerably for half-window
sizes larger than 16 frames. For this collection, opti-
mum quality is achieved with a half-window size of 14.
The upper threshold factor (utf) also affects the
trade-off between recall and precision, with quality
peaking when utf=1. Figure 5 shows that while preci-
sion improves with a larger utf, there is an associated
drop in recall. The best parameter values over all three
collections are hws=14, dmz=0, and utf=1.
Our algorithm performs well relative to compara-
ble systems. In the trec-12 shot boundary detection
task, an earlier implementation was among the better-
performing systems, and obtained the highest preci-
sion of all participants for gradual transitions [29]. It
achieved average recall, above-average frame precision,
and the best results for frame recall. The results pre-
sented here reflect performance after the trec-12 data
was included in the training set, and indicate perfor-
mance in the top four systems of trec-12.
6 Conclusion
Gradual transitions comprise a significant propor-
tion of all shot transitions. The relatively small inter-
frame differences during gradual transitions are often
indistinguishable from normal levels of inter-frame dis-
tance. This makes gradual transitions much harder to
detect than cuts. However, effective identification of
gradual transitions is important for complete video in-
dexing and retrieval.
In this paper, we have proposed a novel approach
to gradual transition detection, based on our moving
query window technique. This monitors the accumu-
lated inter-frame distance of frame collections for de-
tection of gradual shot changes. We have shown that
it is effective on large video collections, with recall and
precision of approximately 83% and 75% respectively.
A particular strength of our approach is the accurate
detection of the start and end of gradual transitions.
We plan to address the high false detection rate
through the use of localised histograms and an edge-
tracking feature. We also intend to explore automatic
parameter selection to allow the system to automati-
cally adapt to different types of footage.
Despite its relative simplicity, our technique shows
good results when tested on a large video collection
comprising a variety of content, and has the potential
to be the basis for more effective video segmentation
gar, A. Jaimes, C. Lang, C.-Y. Lin, A. Natsev,
M. Naphade, C. Neti, H. J. Nock, H. H. Permuter,
R. Singh, J. R. Smith, S. Srinivasan, B. L. Tseng,
T. V. Ashwin, and D. Zhang. IBM Research TREC-
2002 video retrieval system. In E. M. Voorhees and
L. P. Buckland, editors, NIST Special Publication 500-
251: Proceedings of the Eleventh Text REtrieval Con-
ference (TREC 2002), pages 289–298, Gaithersburg,
MD, USA, 19–22 November 2002.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection Dissolve Fade Other
F-Recall F-Precision F-Recall F-Precision F-Recall F-Precision
trec-10 94.5% 81.1% 44.5% 83.1% 63.2% 78.9%
trec-11 94.6% 83.2% 49.0% 88.9% 57.1% 73.1%
trec-12 96.7% 76.6% 88.0% 83.8% 81.5% 87.6%
Table 4. F rame recall and frame precision grouped by transition type for the best runs on each test
6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
(number of frames)
Figure 4. Variation of recall, precision and quality with the HWS for the TREC-12 test collection. The
best recall/precision trade-off (maximum quality) is seen for HWS=14.
[2] P. Aigrain, H. J. Zhang, and D. Petkovic. Content-
based representation and retrieval of visual media: A
state-of-the-art review. Multimedia Tools and Appli-
cations, 3(3):179–202, September 1996.
[3] J. S. Boreczky and L. A. Rowe. Comparison of video
shot boundary detection techniques. Journal of Elec-
tronic Imaging, 5(2):122–128, April 1996.
[4] I. Daubechies. Ten L ectures on Wavelets.Society
for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1992.
[5] A. Del Bimbo. Visual Information Retrieval. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA,
[6] A. Hampapur, R. Jain, and T. Weymouth. Digital
video segmentation. In Proceedings of the ACM In-
ternational Conference on Multimedia, pages 357–364,
San Francisco, CA, USA, 15–20 October 1994.
[7] A. G. Hauptmann and M. J. Witbrock. Story segmen-
tation and detection of commercials in broadcast news
video. In Proceedings of the IEEE International Fo-
rum on Research and Technology Advances in Digital
Libraries (ADL’98), pages 168–179, Santa Barbara,
CA, USA, 22–24 April 1998.
[8] D.Heesch,M.J.Pickering,S.R¨uger, and A. Yavlin-
sky. Video retrieval within a browsing framework using
key frames. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-252: Proceedings of
the Twelfth Text REtrieval Conference (TREC 2003),
Gaithersburg, MD, USA, 18–21 November 2003. To
[9] I. Koprinska and S. Carrato. Temporal video segmen-
tation: A survey. Journal of Signal Pr ocessing: Image
Communication, 16(5):477–500, 2001.
[10] R. W. Lienhart. Comparison of automatic shot bound-
ary detection algorithms. Proceedings of the SPIE;
Storage and Retrieval for Image and Video Databases
VII, 3656:290–301, December 1998.
[11] R. W. Lienhart. Reliable Dissolve Detection. Pro-
ceedings of the SPIE; Storage and Retrieval for Media
Databases, 4315:545–552, December 2001.
[12] R. W. Lienhart. Reliable transition detection in
videos: A survey and practitioner’s guide. In-
ternational Journal of Image and Graphics (IJIG),
1(3):469–486, July 2001.
[13] X. Liu and T. Chen. Shot boundary detection using
temporal statistics modeling. In Proceedings of the
IEEE International Conference on Acoustics Speech
and Signal Proc essing, volume 4, pages 3389–3392, Or-
lando, FL, USA, 13–17 May 2002.
[14] S. Marchand-Maillet. Content-based video retrieval:
An overview. Technical Report 00.06, CUI - University
of Geneva, Geneva, Switzerland, 2000.
[15] A. Nagasaka and Y. Tanaka. Automatic Video In-
dexing and Full-Video Search for Object Appearances.
Visual Database Systems, 2:113–127, 1992.
[16] J. Nam and A. H. Tewfik. Dissolve transition detec-
tion using B-Splines interpolation. In A. Del Bimbo,
editor, IEEE International Conference on Multimedia
and Expo (ICME), volume 3, pages 1349–1352, New
York, NY, USA, 30 July – 2 August 2000.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
Figure 5. Variation of recall, precision and quality with different values of UTF for the TREC-12 test
collection. The highest quality index is observed at UTF=1.
[17] S. Pfeiffer, R. W. Lienhart, and W. Effelsberg. Scene
determination based on video and audio features.
Technical Report TR-98-020, University of Mannheim,
Germany, January 1998.
[18] G. M. Qu´enot, D. Moraru, and L. Besacier. CLIPS
at TRECVID: Shot boundary detection and feature
detection. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-252: Proceedings of
the Twelfth Text REtrieval Conference (TREC 2003),
Gaithersburg, MD, USA, 18–21 November 2003. To
[19] G. M. Qu´enot and P. Mulhem. Two systems for tem-
poral video segmentation. In Proceedings of the Eu-
ropean Workshop on Content Based Multimedia In-
dexing (CBMI’99), pages 187–194, Toulouse, France,
25–27 October 1999.
[20] R. Ruiloba, P. Joly, S. Marchand-Maillet, and G. M.
Qu´enot. Towards a standard protocol for the evalu-
ation of video-to-shots segmentation algorithms. In
Proceedings of the European Workshop on Content
Based Multimedia Indexing (CBMI’99), pages 41–48,
Toulouse, France, 25–27 October 1999.
[21] A. F. Smeaton, W. Kraaij, and P. Over. TRECVID-
2003 An introduction. In E. M. Voorhees and
L. P. Buckland, editors, NIST Special Publication 500-
252: Proceedings of the Twelfth Text REtrieval Con-
ference (TREC 2003), Gaithersburg, MD, USA, 18–21
November 2003. To appear.
[22] A. F. Smeaton and P. Over. The TREC-2002 video
track report. In E. M. Voorhees and L. P. Buck-
land, editors, NIST Special Public ation 500-251: Pro-
ceedings of the Eleventh Text REtrieval Conference
(TREC 2002), pages 69–85, Gaithersburg, MD, USA,
19–22 November 2002.
[23] A. F. Smeaton, P. Over, and R. Taban. The TREC-
2001 video track report. In E. M. Voorhees and D. K.
Harman, editors, NIST Special Publication 500-250:
Proceedings of the Tenth Text REtrieval Conference
(TREC 2001), pages 52–60, Gaithersburg, MD, USA,
13–16 November 2001.
[24] J. R. Smith. Content-based access of image and video
libraries. Encyclopedia of Library and Information
Science, 1:40–61, 2001.
[25] J. Sun, S. Cui, X. Xu, and Y. Luo. Automatic video
shot detection and characterization for content-based
video retrieval. Proceedings of the SPIE; Visualization
and Optimisation Techniques, 4553:313–320, Septem-
ber 2001.
[26] S. M. M. Tahaghoghi, J. A. Thom, and H. E. Williams.
Shot boundary detection using the moving query win-
dow. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-251: Proceedings
of the Eleventh Text REtrieval Conference (TREC
2002), pages 529–538, Gaithersburg, MD, USA, 19–
22 November 2002.
[27] S. M. M. Tahaghoghi, J. A. Thom, H. E. Williams, and
T. Volkmer. Video cut detection using frame windows.
In submission.
[28] B. T. Truong, C. Dorai, and S. Venkatesh. New en-
hancements to cut, fade, and dissolve detection pro-
cesses in video segmentation. In R. Price, editor, Pro-
ceedings of the ACM International Conference on Mul-
timedia 2000, pages 219–227, Los Angeles, CA, USA,
30 October – 4 November 2000.
[29] T. Volkmer, S. M. M. Tahaghoghi, H. E. Williams,
and J. A. Thom. The moving query window for shot
boundary detection at TREC-12. In E. M. Voorhees
and L. P. Buckland, editors, NIST Sp ecial Publication
500-252: Proceedings of the Twelfth Text REtrieval
Conference (TREC 2003), Gaithersburg, MD, USA,
18–21 November 2003. To appear.
[30] J. Yu and M. D. Srinath. An efficient method for
scene cut detection. Pattern Recognition Letters,
22(13):1379–1391, January 2001.
[31] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar. Au-
tomatic partitioning of full-motion video. Multimedia
Systems Journal, 1(1):10–28, June 1993.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
... To identify stable brain microstates as well as transition states across these discrete brain microstates, we used a modified microstate segmentation algorithm that we derived from the gradual transition detection algorithm proposed by Volkmer et al. (2004) for identifying video scene changes. The algorithm incorporates a root mean square error (RMSE) metric and a Confidence Interval (CI) based on baseline data to identify potential stable and discrete brain microstates. ...
... In gradual transition detection theory, Volkmer et al. (2004), studying video streams, observe that a gradual transition from one shot to another (such as a fade, wipe, or dissolve) can be automatically identified by locating instances of maximal distance between each video frame and a number of prior and subsequent frames. In the current adaptation of this theoretical approach, each discrete timeframe in an ERP (topographic potential map) is represented as a vector of n electrode readings, and the RMSE is used to compute the distance between maps. ...
Full-text available
Background: Since Berger's first EEG recordings in 1929, several techniques, initially developed for investigating periodic processes, have been applied to study non-periodic event-related brain state dynamics. New method: We provide a theoretical comparison of the two approaches and present a new suite of data-driven analytic tools for the specific identification of the brain microstates in high-density event-related brain potentials (ERPs). This suite includes four different analytic methods. We validated this approach through a series of theoretical simulations and an empirical investigation of a basic visual paradigm, the reversal checkerboard task. Results: Results indicate that the present suite of data-intensive analytic techniques, improves the spatiotemporal information one can garner about non-periodic brain microstates from high-density electrical neuroimaging data. Comparison with existing method(s): Compared to the existing methods (such as those based on k-clustering methods), the current micro-segmentation approach offers several advantages, including the data-driven (automatic) detection of non-periodic quasi-stable brain states. Conclusion: This suite of quantitative methods allows the automatic detection of event-related changes in the global pattern of brain activity, putatively reflecting changes in the underlying neural locus for information processing in the brain, and event-related changes in overall brain activation. In addition, within-subject and between-subject bootstrapping procedures provide a quantitative means of investigating how robust are the results of the micro-segmentation.
... Dissolves, fade-in, and fade-out are also discussed in [23], [4], [6]. Volkmer et al. [24] have presented a gradual transition detection approach based on the average frame similarity and the adaptive thresholds. Albanese et al. [7] have proposed an algorithm which use similarity metric based on the animate vision theory. ...
... In [36], methods for detecting shot boundaries in video sequences and for extracting key frames using metrics based on information theory are proposed. In [37], a moving query window of frames is maintained and the average frame similarities of the frames of the left side and right side of the center frame are calculated, respectively. Then the ratio of these two similarities is monitored and used to detect gradual transitions. ...
... In combination with the variance between the frames, morphological filtering is used to detect the dissolve effects. (Volkmer et al., 2004) used average frame similarity and adaptive threshold for detection of gradual transitions. In this approach the frames were grouped into two different sets, pre-frames and post-frames. ...
Full-text available
During video editing, the shots composing the video are coalesced together by different types of transition effects. These editing effects are classified into abrupt and gradual changes, based on the inherent nature of these transitions. In abrupt transitions, there is an instantaneous change in the visual content of two consecutive frames. Gradual transitions are characterized by a slow and continuous change in the visual contents occurring between two shots. In this chapter, the challenges faced in this field along with an overview of the different approaches are presented. Also, a novel method for detection of dissolve transitions using a two-phased approach is enumerated. The first phase deals with detection of candidate dissolves by identifying parabolic patterns in the mean fuzzy entropy of the frames. In the second phase, an ensemble of four parameters is used to design a filter which eliminates candidates based on thresholds set for each of the four stages of filtration. The experimental results show a marked improvement over other existing methods.
... Briefly, we first used a modified microstate segmentation algorithm that we derived from the gradual transition detection algorithm proposed by Volkmer, Tahaghoghi, & Williams (2004) for identifying video scene changes to identify stable evoked brain microstates as well as transition states across these discrete brain microstates (see Cacioppo et al., 2014). To identify potential stable and discrete brain microstates, the algorithm incorporates a RMSE metric and a CI based on baseline data. ...
We introduce a new analytic technique for the microsegmentation of high-density EEG to identify the discrete brain microstates evoked by the visual reversal checkerboard task. To test the sensitivity of the present analytic approach to differences in evoked brain microstates across experimental conditions, subjects were instructed to (a) passively view the reversals of the checkerboard (passive viewing condition), or (b) actively search for a target stimulus that may appear at the fixation point, and they were offered a monetary reward if they correctly detected the stimulus (active viewing condition). Results revealed that, within the first 168 ms of a checkerboard presentation, the same four brain microstates were evoked in the passive and active viewing conditions, whereas the brain microstates evoked after 168 ms differed between these two conditions, with more brain microstates elicited in the active than in the passive viewing condition. Additionally, distinctions were found in the active condition between a change in a scalp configuration that reflects a change in microstate and a change in scalp configuration that reflects a change in the level of activation of the same microstate. Finally, the bootstrapping procedure identified that two microstates lacked robustness even though statistical significance thresholds were met, suggesting these microstates should be replicated prior to placing weight on their generalizability across individuals. These results illustrate the utility of the analytic approach and provide new information about the spatiotemporal dynamics of the brain states underlying passive and active viewing in the visual checkerboard task.
... In [Volkmer04], authors propose a gradual transition detection method, based on a sliding window centered on the current frame. A distance between the current frame and all the other frames in the sliding window is then computed. ...
Full-text available
Recent advances in telecommunications, collaborated with the development of image and video processing and acquisition devices has lead to a spectacular growth of the amount of the visual content data stored, transmitted and exchanged over Internet. Within this context, elaborating efficient tools to access, browse and retrieve video content has become a crucial challenge. In Chapter 2 we introduce and validate a novel shot boundary detection algorithm able to identify abrupt and gradual transitions. The technique is based on an enhanced graph partition model, combined with a multi-resolution analysis and a non-linear filtering operation. The global computational complexity is reduced by implementing a two-pass approach strategy. In Chapter 3 the video abstraction problem is considered. In our case, we have developed a keyframe representation system that extracts a variable number of images from each detected shot, depending on the visual content variation. The Chapter 4 deals with the issue of high level semantic segmentation into scenes. Here, a novel scene/DVD chapter detection method is introduced and validated. Spatio-temporal coherent shots are clustered into the same scene based on a set of temporal constraints, adaptive thresholds and neutralized shots. Chapter 5 considers the issue of object detection and segmentation. Here we introduce a novel spatio-temporal visual saliency system based on: region contrast, interest points correspondence, geometric transforms, motion classes' estimation and regions temporal consistency. The proposed technique is extended on 3D videos by representing the stereoscopic perception as a 2D video and its associated depth
... Naranjo et al. [13] used morphological operators for the detection of gradual transitions. Volkmer et al. used average frame similarity and adaptive threshold in [14] for detection of gradual transitions. This work provides two major contributions. ...
... Fixed pre-defined values would be acceptable for specific domains, given that they have been carefully chosen. However, adaptive and automatically chosen threshold values (Volkmer, Tahaghoghi, & Williams, 2004), (Bescos, Cisneros, Martinez, Menendez, & Cabrera, 2005), (Ionescu et al., 2007) would be much desirable to reduce the need for human intervention and to adapt to the varying data ranges. In fact, avoiding the thresholds would be even better as in (Yu, Tian, & Tang, 2007), where the Self-Organizing Map (SOM) network was used, which avoided the need for thresholds. ...
Full-text available
Video processing and segmentation are important stages for multimedia data mining, especially with the advance and diversity of video data available. The aim of this chapter is to introduce researchers, especially new ones, to the "video representation, processing, and segmentation techniques". This includes an easy and smooth introduction, followed by principles of video structure and representation, and then a state-of-the-art of the segmentation techniques focusing on the shot-detection. Performance evaluation and common issues are also discussed before concluding the chapter.
This paper proposes a two stage algorithm for streaming video segmentation. In the first stage, shot boundaries are detected within a window of frames by comparing dissimilarity between 2-D segmentations of each frame. In the second stage, the 2-D segments are propagated across the window of frames in both spatial and temporal direction. The window is moved across the video to find all shot transitions and obtain spatio-temporal segments simultaneously. As opposed to techniques that operate on entire video, the proposed approach consumes significantly less memory and enables segmentation of lengthy videos. We tested our segmentation based shot detection method on the TRECVID 2007 video dataset and compared it with block-based technique. Cut detection results on the TRECVID 2007 dataset indicate that our algorithm has comparable results to the best of the block-based methods. The streaming video segmentation routine also achieves promising results on a challenging video segmentation benchmark database.
A large number of shot boundary detection, or equivalently, transition detection techniques have been developed in recent years. They all can be classified based on a few core concepts underlying the different detection schemes. This survey emphasizes those different core concepts underlying the different detection schemes for the three most widely used video transition effects: hard cuts, fades and dissolves. Representative of each concept one or a few very sound and thoroughly tested approaches are present in detail, while others are just listed. Whenever reliable performance numbers could be found in the literature, they are mentioned. Guidelines for practitioners in video processing are also given.
Conference Paper
The Informedia Digital Library Project [Wactlar96] allows full content indexing and retrieval of text, audio and video material. Segmentation is an integral process in the Informedia digital video library. The success of the Informedia project hinges on two critical assumptions: that we can extract sufficiently accurate speech recognition transcripts from the broadcast audio and that we can segment the broadcast into video paragraphs, or stories, that are useful for information retrieval.In previous papers [Hauptmann97, Witbrock97, Witbrock98], we have shown that speech recognition is sufficient for information retrieval of pre-segmented video news stories. In this paper we address the issue of segmentation and demonstrate that a fully automatic system can extract story boundaries using available audio, video and closed-captioning cues.The story segmentation step for the Informedia Digital Video Library splits full-length news broadcasts into individual news stories. During this phase the system also labels commercials as separate stories. We explain how the Informedia system takes advantage of the closed captioning frequently broadcast with the news, how it extracts timing information by aligning the closed-captions with the result of the speech recognition, and how the system integrates closed-caption cues with the results of image and audio processing.
In this paper, firstly, several video shot detection technologies have been discussed. An edited video consists of two kinds of shot boundaries have been known as straight cuts and optical cuts. Experimental result using a variety of videos are presented to demonstrate that moving window detection algorithm and 10-step difference histogram comparison algorithm are effective for detection of both kinds of shot cuts. After shot isolation, methods for shot characterization were investigated. We present a detailed discussion of key-frame extraction and review the visual features, particularly the color feature based on HSV model, of key-frames. Video retrieval methods based on key-frames have been presented at the end of this section. This paper also present an integrated system solution for computer- assisted video parsing and content-based video retrieval. The application software package was programmed on Visual C++ development platform.