Content uploaded by S.M.M. Tahaghoghi
Author content
All content in this area was uploaded by S.M.M. Tahaghoghi on Jul 05, 2014
Content may be subject to copyright.
Gradual Transition Detection Using Average Frame Similarity
Timo Volkmer S. M. M. Tahaghoghi Hugh E. Williams
School of Computer Science and Information Technology
RMIT University
GPO Box 2476V, Melbourne, Australia 3001
{tvolkmer,saied,hugh}@cs.rmit.edu.au
Abstract
Segmenting digital video into its constituent basic se-
mantic entities, or shots, is an important step for ef-
fective management and retrieval of video data. Recent
automated techniques for detecting transitions between
shots are highly effective on abrupt transitions. How-
ever, automated detection of gradual transitions, and
the precise determination of the corresponding start
and end frames, remains problematic. In this pa-
per, we present a gradual transition detection approach
based on average frame similarity and adaptive thresh-
olds. We report good detection results on the trec
video track collections — particularly for dissolves and
fades — and very high accuracy in identifying transi-
tion boundaries. Our technique is a valuable new tool
for transition detection.
1 Introduction
The volume of video content produced daily is ex-
tremely large, and is likely to increase with the ever-
growing popularity of digital video consumer products.
For this content to be usable, it must be easily accessi-
ble. An important first step is to identify and annotate
sections of interest.
Historically, identification and annotation of video
have been performed by human annotators [30, 31].
This is tedious, expensive, and susceptible to error.
Moreover, it relies on the judgement of the human ob-
server. This is inherently subjective, and often incon-
sistent. Automatic indexing methods have the poten-
tial to avoid these problems.
Part of the analysis process is to identify and deter-
mine the boundaries of the basic semantic elements, the
shots [5]. The transition between adjacent shots can
be abrupt — a cut — or gradual. The former category
describes a shot change where two consecutive frames
belong to different shots. The latter involves a progres-
sive changeover between two shots using video editing
techniques such as dissolves, fades, and wipes [6].
Gradual transitions are less frequent than cuts, but
are more complex. Lienhart [10] reports that together,
cuts, fades, and dissolves account for approximately
99% of all transitions in all types of video. In the
video collections we use, approximately 70% of the
annotated transitions are cuts, while 26.5% are fades
or dissolves. These collections are discussed in detail
later. It is likely that the proportion of rarer transition
effects will increase as powerful video editing tools en-
ter mainstream use. Nevertheless, fades and dissolves
remain the most common forms of gradual transition,
and their accurate identification is important for effec-
tive video retrieval.
Automatic cut detection approaches have been
shown to be highly effective [1, 18, 27]. Indeed, the
results are comparable to results obtained by human
observers [2]. However, gradual transitions are more
difficult to detect using automated systems [14, 22].
The often subtle changes between frames are hard to
discriminate from changes caused by normal scene ac-
tivity. In particular, camera motion and zoom opera-
tions often confuse detection algorithms.
In this paper we present our novel approach to grad-
ual transition detection in video. Our moving query
window technique [26, 27] caters for the fact that grad-
ual transitions usually extend over several frames by
evaluating the average inter-frame distance in a set of
frames, rather than examining only individual frames.
Moreover, we compute thresholds dynamically to in-
crease effectiveness across different types of video con-
tent.
Our results are promising across different test collec-
tions of the Text REtrieval Conference (trec)VIDeo
Retrieval Evaluation (trecvid)
1
. Weconcludethat
1
http://www-nlpir.nist.gov/projects/trecvid
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Current Frame Post FramesPre Frames
DMZ
Figure 1. An equal number of frames on each side of the current frame — the pre-frames and the
post-frames — constitute the moving query window. We can optionally specify a Demilitarised Zone
(DMZ); frames falling within the DMZ for a part icular current frame are omitted from the comparisons
for that frame. The DMZ is explained in more detail in Section 5.
our approach constitutes a good basis for effective and
efficient video indexing.
2 Background
Popular shot boundary detection approaches rely on
the property that adjacent frames within one shot are
usually similar. By evaluating inter-frame differences
and searching for significant dissimilarities, transitions
between shots can be detected.
Digitised video is commonly stored compressed in
one of the mpeg
2
formats. Many automatic tech-
niques for determining shot boundaries use aspects of
the compressed data directly. Koprinska et al. [9] give
an overview of such methods. In this paper, we focus on
techniques that are applied to uncompressed footage.
The majority of approaches to shot boundary de-
tection compute inter-frame distances from the decom-
pressed video. Shot transitions can be detected by
monitoring this distance for significant changes. In
direct image comparison, changes between adjacent
frames are determined on a pixel-by-pixel basis. While
this approach shows generally good results [3], it is
computationally intensive, and also sensitive to camera
motion, camera zoom, and noise. Additional filtering
may be used to address some of these problems [18].
An alternative — and more common — approach is
to use histograms of frame feature data. Approaches
using global histograms [15, 28, 31] represent each
frame as a single vector, while those using localised
histograms [24] generate separate histograms for sub-
sections of each frame. Inter-frame distances are cal-
culated using often simple vector-distance measures to
compare corresponding histograms [31]. Localised his-
tograms, used in conjunction with additional features
such as edge-detection, perform well when applied in
the trecvid environment [1, 8].
2
Moving Picture Experts Group:
http://www.chiariglione.org/mpeg/
The twin-comparison algorithm first proposed by
Zhang et al. [31] is the basis of several proposed ap-
proaches for detection of gradual transitions [8, 25, 30].
Here, a low threshold is applied to detect groups of
frames that belong to a possible gradual transition.
The accumulative inter-frame distance is calculated
for these frames. A gradual transition is reported if
the accumulated inter-frame distance exceeds a second,
higher threshold.
Several approaches have been proposed that are
based on the video production model. These employ
internal transition models based on the operation of
video editing systems. One or more features of the
video are monitored for patterns very similar to those
predicted by the internal models [11, 12, 13, 16]. These
approaches show promising results, but we are not
aware of any large-scale evaluation.
Some video segmentation systems also consider fea-
tures such as audio information or captions [7, 17].
These are usually designed for a particular task on
specific types of content, for example the detection of
commercial breaks in television footage.
We have previously proposed a technique for effec-
tive cut detection. The moving query window tech-
nique [26] performs comparisons on a set of frames
to detect abrupt transitions. As we proceed through
the video, we take each frame in turn as a pivot, and
consider a fixed-size window of frames encompassing
each pivot or current frame. This moving window is
comprised of two equal-sized sets of frames preceding
and following the current frame, as illustrated in Fig-
ure 1. All frames in the moving window are ranked
on their histogram similarity to the current frame; the
most similar frame is ranked highest. The number of
frames from the preceding half window that are ranked
in the top half is monitored while advancing through
the video. A cut is reported when this number exceeds
an upper threshold and falls below a lower threshold
within four consecutive frames.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Pre-frames Current frame Post-frames
PrePostRatio
AAAAAA AAAAAA
A
minimal
AAA AAAAA BBB
slowly rising
A
BBBBBB
BBBBB
falling
B
BBB
A BBBBBB
B
maximum
A
AAAAA BBBBBB
steeply rising
A
34
18
1
12
11
15
22
22
22
22 24
21
25
28
32
8
5
12
12
12
14
Figure 2. An example of a dissolve. Before the transition, the PrePostRatio is minimal. It rises to a
maximum as we proceed through the transition, before falling again afterwards.
The effectiveness of this approach for cut detection
has been demonstrated with the collections of the trec
Video Retrieval Evaluation [26, 27, 29]. However, with-
out modification, this scheme is less effective on grad-
ual transitions. For example, after training on the
trec-10 [23] test collection, we obtain cut detection
quality index values of 91% for blind runs on both the
trec-11 [22] and trec-12 [21] collection. However,
the corresponding values for detection of gradual tran-
sitions are 40% and 35% respectively.
Aiming for a simplified, and more effective detec-
tion scheme for gradual transitions, we have developed
an alternative technique for application in our moving
query window.
3 Gradual Transition Detection with
the Moving Query Window
In this section, we propose a novel extension of our
moving query window approach that permits effective
detection of gradual transitions.
Our method of ranking frames in the query win-
dow works well for abrupt transitions because these
usually show significant inter-frame distances within a
few consecutive frames. Our observations have shown
that this is not usually the case for gradual transitions,
where inter-frame distances are typically smaller. This
results in our approach being far less effective in de-
tecting gradual transitions.
To address this problem, we propose that the frames
in the moving window not be examined individually.
Instead, we define two sets of frames, one each from
either side of the current frame; we refer to the frames
of these two sets as pre-frames and post-frames respec-
tively. For each of the two sets, we determine the dis-
tance between each frame in that set and the current
frame. We then average these intra-set distances, giv-
ing a final value that is the average distance between
that set and the current frame. This computation re-
sults in two values, one each for the pre- and post-frame
sets, and we use the ratio of these values — referred to
as PrePostRatio — to detect gradual transitions.
Consider an example: Figure 2 shows a dissolve be-
tween the neighbouring shots A and B. We assume that
the dissolve starts at frame 12 and ends with frame 22.
In the top row, frame 11 is the current frame; it be-
longs to shot A and is the last frame before the tran-
sition starts. Frames 1 to 10 form the pre-frames and
are also from shot A. They are similar to frame 11,
and therefore, their inter-frame distance to the current
frame is relatively low. For this example, let us assume
the average inter-frame distance of the pre-frames to
the current frame has the value 2.
Frames 12 to 21 — the post-frames in Figure 2 — are
mostly dissolve frames, and therefore relatively dissim-
ilar to the current frame. Hence, the average inter-
frame distance for the post-frames is comparatively
high; let us assume it has the value 10. Given a pre-
frame average of 2 and a post-frame average of 10, the
PrePostRatio of the top row in Figure 2 is
2
10
=0.2.
As the current frame moves further into the dissolve,
the ratio rises. This is illustrated in rows two and three
of Figure 2. In the fourth row, frame 22 is the current
frame and also the last frame of the transition. This
frame is likely to be very similar to frames 23 to 32
that belong to shot B, producing a low average inter-
frame distance. For our example, let us take this value
to be 2.
The pre-frames that are formed by frames 12 to 21
are the frames of the dissolve. As we have established
earlier, their average inter-frame distance is high, we
again assume a value of 10. We can now calculate the
PrePostRatio for row four as
10
2
= 5. Once the win-
dow exits the transition completely, the ratio usually
reverts to a relatively low value.
We have observed that this behaviour is common for
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
24800 24820 24840 24860 24880 24900 24920 24940 24960 24980 25000
PrePostRatio
0
2
4
6
8
10
Pre/Post Ratio
Moving Average
Threshold
cut
fade
dissolve
Figure 3. Plot of PrePostRatio over a 200- frame interval of video. The dynamic threshold is calculated
from a moving average of the PrePostRatio.
dissolves and fades. By monitoring the PrePostRatio
as we advance through a video clip, we can detect the
minima and maxima that accompany the start and end
of such transitions. Other effects, such as wipes and
page translations, are more complex, and often include
intense motion. Such transitions can also be detected
using our approach, but with reduced effectiveness.
We maintain a history of PrePostRatio values, and
calculate a moving average and standard deviation that
we use to compute a threshold. Detailed analysis of
the PrePostRatio curves indicates that application
of this threshold works well, and caters for varying
levels of the computed ratio across different types of
footage. However, it is sometimes necessary to adjust
the level of this threshold. For example, poor quality
and noisy footage produces many smaller peaks in the
PrePostRatio curve. To reduce false detections caused
by these peaks, we multiply the calculated threshold by
a factor we call the Upper Threshold Factor (utf).
Figure 3 shows the PrePostRatio curve for a 200-
frame segment of a video, along with the correspond-
ing moving average and threshold. A possible gradual
transition is indicated if the PrePostRatio crosses the
threshold. In this case, we determine the position of
the local minimum within the preceding frames. If this
is sufficiently small, a gradual transition is reported
over the interval between these two points.
The most important algorithm parameter influenc-
ing the results is the number of pre- and post-frames
on either side of the current frame, which we refer to as
the Half-Window Size (hws). The number of frames
in the entire query window is then 2 × hws,asshown
in Figure 1. The current frame is not part of the query
window.
We discuss the effect of the parameters on system
performance further in Section 5 along with detailed
results. In the next section we discuss the environment
used to train and test our algorithm.
4 Evaluation Environment
We apply the common Information Retrieval mea-
sures of recall and precision to evaluate effective-
ness [20]. Recall is the fraction of all known transi-
tions that are correctly detected, while precision is the
fraction of reported transitions that match the known
transitions recorded in the reference data.
Additional effectiveness measures designed specifi-
cally for gradual transitions are Frame Recall (fr)and
Frame Precision (fp) [22]. These are defined as follows:
FR =
Frames correctly reported in detected transition
Frames in reference data for detected transition
FP =
Frames correctly reported in detected transition
Frames reported in detected transition
We also calculate a quality index (Q) that penalises
false negatives more heavily than false positives. False
detections are regarded less problematic, as they can
be filtered out in later processing steps [19]:
Q =
N
C
−
N
I
3
N
T
where:
N
C
= Number of correctly reported transitions
N
I
= Number of false detections
N
T
= Number of transitions in reference data
We developed our algorithm using the shot bound-
ary detection task subset of the trec-10 video col-
lection. Detailed results of blind runs of trec-12,
including comparison with other approaches, appear
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection Clips Frames Abrupt Gradual
trec-10 18 594 179 2 066 1 037
trec-11 18 545 068 1 466 591
trec-12 13 596 054 2 364 1 012
Table 1. Details of our t est collections.
elsewhere [29]. We have since improved our technique
through further training on the trec-11 and trec-12
test sets. In this paper, we discuss results obtained
with the current approach. Details of all test sets that
we use are shown in Table 1.
The trec-10 and trec-11 collections contain a
variety of documentary and educational cinema and
television footage, some more than fifty years old.
The trec-11 collection also includes amateur video.
The collection used in trec-12 comprises more recent
footage, mostly television news and entertainment pro-
gramming from the period 1998-2002. All three col-
lections contain a large number of annotated abrupt
transitions which we do not consider in this paper, but
have explored in detail elsewhere [27].
The reference data for the collections categorises
gradual transitions into three classes:
Dissolve: One shot is replaced with another by grad-
ually dimming the first shot and gradually increas-
ing the brightness of the second.
Fade-out or fade-in: A fade-in or fade-out can be
considered to be a special case of a dissolve, with
the first or second shot consisting of frames of only
one colour, usually black. A common transition is
a fade-out of one shot, followed by a fade-in of the
next.
Other: This category comprises all other transition ef-
fects that stretch over more than two frames, such
as wipes and pixelation effects, and also artifacts
of imperfect splicing of the original cine film.
Many gradual transitions extend only over a handful
of frames, and are effectively observed as cuts by hu-
man viewers at the standard replay speed of 24 or 30
frames per second. The two known dissolves marked in
Figure 3 are examples of such short transitions, with
an effective length of only three frames each. In ac-
cordance with the trecvid guidelines that a cut may
stretch over up to six consecutive frames [23], we con-
sider such short transitions to be abrupt, rather than
gradual, transitions.
5Results
We have experimented with one-dimensional and
three-dimensional histograms using the rgb and hsv
colour spaces [24], and also with a feature based on
the Daubechies wavelet coefficients of the transformed
frame data [4]. We have found that gradual transitions
are best detected with one-dimensional hsv colour his-
tograms using 32 bins per colour component. All re-
sults reported in this paper are for this feature repre-
sentation.
Table 2 shows results for detecting gradual transi-
tion for each of the three trec collections using algo-
rithm parameters that produce the best performance.
As with most applications, it is generally possible to
trade precision for higher recall. These results show
that the technique performs best for the trec-10 test
collection. We observe that for the two newer trec
collections, and especially for trec-12, recall drops
considerably. This is caused in large part by the ap-
pearance of transitions in rapid succession, which our
algorithm tends to report as a single transition.
The trec-11 collection also contains a large pro-
portion of older footage with low quality, and is the
most challenging for our system. The high number of
false positives has a negative effect on precision and
quality. We cater for this by raising the threshold level
(utf) by 20%, and by applying a dmz of one frame on
either side of the current frame to reduce the effects
of low video quality, camera motion, and compression
artifacts. The demilitarised zone allows frames imme-
diately adjacent to the current frame to not be consid-
ered as part of the pre- and post-frame sets, permitting
less sensitivity in lower-quality footage. This produces
the best compromise between recall and precision for
this collection. Although recall on the trec-12 col-
lection is rather low, precision remains reasonable, and
at 31.9%, the rate of false positives is not unacceptable.
More detailed results are provided in Table 3. Since
our system does not yet distinguish between transi-
tion types, we cannot calculate the individual insertion
rates. As expected, our approach performs better for
dissolves and fades than for other, less common, grad-
ual transition types.
Table 4 shows the frame recall and frame precision
obtained for each collection. We observe very good re-
sults for all types of gradual transitions. Frame recall
for fades in the trec-10 and the trec-11 collections is
relatively low. We find that in these collections, the av-
erage length of fades is 80 and 89 frames respectively,
while the corresponding average in the trec-12 col-
lection is 29 frames. Our implementation is currently
limited to detection of gradual transitions spanning less
than 60 frames.
The values used for the algorithm parameters play
an important part in determining effectiveness. Fig-
ure 4 illustrates the effect of varying the half-window
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection hws dmz utf Recall Precision Quality Deletions Insertions
trec-10 18 1 1.0 83.5% 75.0% 72.4% 16.4% 33.4%
trec-10 18 1 1.2 76.4% 80.7% 69.4% 23.5% 21.1%
trec-11 18 1 1.0 81.7% 56.8% 33.8% 18.2% 144.0%
trec-11 18 1 1.2 64.5% 77.0% 53.4% 22.9% 70.8%
trec-12 14 0 1.0 65.9% 76.4% 56.0% 34.0% 29.6%
trec-12 14 0 1.2 58.9% 82.5% 53.7% 41.0% 15.8%
Table 2. Results of the best runs for gradual transitions for the TREC video collections and t he
parameters used in these runs.
Collection Reference transitions Recall Deletions Insertions
Dissolve Fade Other Dissolve Fade Other Dissolve Fade Other All
trec-10 942 54 41 86.2% 68.5% 70.7% 13.8% 42.6% 24.4% 35.8%
trec-11 510 63 18 78.6% 76.2% 61.1% 21.6% 31.8% 38.9% 67.2%
trec-12 684 116 212 76.9% 47.4% 44.8% 25.1% 52.6% 55.7% 31.9%
Table 3. Results grouped by transition type for the best run on each test collection. We observe much
better performance for dissolves and fades than for other types of gradual transition.
size (hws) on recall, precision, and quality for the
trec-12 collection. With a larger hws, precision in-
creases but recall drops considerably for half-window
sizes larger than 16 frames. For this collection, opti-
mum quality is achieved with a half-window size of 14.
The upper threshold factor (utf) also affects the
trade-off between recall and precision, with quality
peaking when utf=1. Figure 5 shows that while preci-
sion improves with a larger utf, there is an associated
drop in recall. The best parameter values over all three
collections are hws=14, dmz=0, and utf=1.
Our algorithm performs well relative to compara-
ble systems. In the trec-12 shot boundary detection
task, an earlier implementation was among the better-
performing systems, and obtained the highest preci-
sion of all participants for gradual transitions [29]. It
achieved average recall, above-average frame precision,
and the best results for frame recall. The results pre-
sented here reflect performance after the trec-12 data
was included in the training set, and indicate perfor-
mance in the top four systems of trec-12.
6 Conclusion
Gradual transitions comprise a significant propor-
tion of all shot transitions. The relatively small inter-
frame differences during gradual transitions are often
indistinguishable from normal levels of inter-frame dis-
tance. This makes gradual transitions much harder to
detect than cuts. However, effective identification of
gradual transitions is important for complete video in-
dexing and retrieval.
In this paper, we have proposed a novel approach
to gradual transition detection, based on our moving
query window technique. This monitors the accumu-
lated inter-frame distance of frame collections for de-
tection of gradual shot changes. We have shown that
it is effective on large video collections, with recall and
precision of approximately 83% and 75% respectively.
A particular strength of our approach is the accurate
detection of the start and end of gradual transitions.
We plan to address the high false detection rate
through the use of localised histograms and an edge-
tracking feature. We also intend to explore automatic
parameter selection to allow the system to automati-
cally adapt to different types of footage.
Despite its relative simplicity, our technique shows
good results when tested on a large video collection
comprising a variety of content, and has the potential
to be the basis for more effective video segmentation
tools.
References
[1]B.Adams,A.Amir,C.Dorai,S.Ghosal,G.Iyen-
gar, A. Jaimes, C. Lang, C.-Y. Lin, A. Natsev,
M. Naphade, C. Neti, H. J. Nock, H. H. Permuter,
R. Singh, J. R. Smith, S. Srinivasan, B. L. Tseng,
T. V. Ashwin, and D. Zhang. IBM Research TREC-
2002 video retrieval system. In E. M. Voorhees and
L. P. Buckland, editors, NIST Special Publication 500-
251: Proceedings of the Eleventh Text REtrieval Con-
ference (TREC 2002), pages 289–298, Gaithersburg,
MD, USA, 19–22 November 2002.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
Collection Dissolve Fade Other
F-Recall F-Precision F-Recall F-Precision F-Recall F-Precision
trec-10 94.5% 81.1% 44.5% 83.1% 63.2% 78.9%
trec-11 94.6% 83.2% 49.0% 88.9% 57.1% 73.1%
trec-12 96.7% 76.6% 88.0% 83.8% 81.5% 87.6%
Table 4. F rame recall and frame precision grouped by transition type for the best runs on each test
collection.
6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
(number of frames)
(%)
50
60
70
80
Precision
Recall
Quality
Figure 4. Variation of recall, precision and quality with the HWS for the TREC-12 test collection. The
best recall/precision trade-off (maximum quality) is seen for HWS=14.
[2] P. Aigrain, H. J. Zhang, and D. Petkovic. Content-
based representation and retrieval of visual media: A
state-of-the-art review. Multimedia Tools and Appli-
cations, 3(3):179–202, September 1996.
[3] J. S. Boreczky and L. A. Rowe. Comparison of video
shot boundary detection techniques. Journal of Elec-
tronic Imaging, 5(2):122–128, April 1996.
[4] I. Daubechies. Ten L ectures on Wavelets.Society
for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 1992.
[5] A. Del Bimbo. Visual Information Retrieval. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA,
2001.
[6] A. Hampapur, R. Jain, and T. Weymouth. Digital
video segmentation. In Proceedings of the ACM In-
ternational Conference on Multimedia, pages 357–364,
San Francisco, CA, USA, 15–20 October 1994.
[7] A. G. Hauptmann and M. J. Witbrock. Story segmen-
tation and detection of commercials in broadcast news
video. In Proceedings of the IEEE International Fo-
rum on Research and Technology Advances in Digital
Libraries (ADL’98), pages 168–179, Santa Barbara,
CA, USA, 22–24 April 1998.
[8] D.Heesch,M.J.Pickering,S.R¨uger, and A. Yavlin-
sky. Video retrieval within a browsing framework using
key frames. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-252: Proceedings of
the Twelfth Text REtrieval Conference (TREC 2003),
Gaithersburg, MD, USA, 18–21 November 2003. To
appear.
[9] I. Koprinska and S. Carrato. Temporal video segmen-
tation: A survey. Journal of Signal Pr ocessing: Image
Communication, 16(5):477–500, 2001.
[10] R. W. Lienhart. Comparison of automatic shot bound-
ary detection algorithms. Proceedings of the SPIE;
Storage and Retrieval for Image and Video Databases
VII, 3656:290–301, December 1998.
[11] R. W. Lienhart. Reliable Dissolve Detection. Pro-
ceedings of the SPIE; Storage and Retrieval for Media
Databases, 4315:545–552, December 2001.
[12] R. W. Lienhart. Reliable transition detection in
videos: A survey and practitioner’s guide. In-
ternational Journal of Image and Graphics (IJIG),
1(3):469–486, July 2001.
[13] X. Liu and T. Chen. Shot boundary detection using
temporal statistics modeling. In Proceedings of the
IEEE International Conference on Acoustics Speech
and Signal Proc essing, volume 4, pages 3389–3392, Or-
lando, FL, USA, 13–17 May 2002.
[14] S. Marchand-Maillet. Content-based video retrieval:
An overview. Technical Report 00.06, CUI - University
of Geneva, Geneva, Switzerland, 2000.
[15] A. Nagasaka and Y. Tanaka. Automatic Video In-
dexing and Full-Video Search for Object Appearances.
Visual Database Systems, 2:113–127, 1992.
[16] J. Nam and A. H. Tewfik. Dissolve transition detec-
tion using B-Splines interpolation. In A. Del Bimbo,
editor, IEEE International Conference on Multimedia
and Expo (ICME), volume 3, pages 1349–1352, New
York, NY, USA, 30 July – 2 August 2000.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.
0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
(UTF)
(%)
44
49
54
59
64
69
74
79
84
Precision
Recall
Quality
Figure 5. Variation of recall, precision and quality with different values of UTF for the TREC-12 test
collection. The highest quality index is observed at UTF=1.
[17] S. Pfeiffer, R. W. Lienhart, and W. Effelsberg. Scene
determination based on video and audio features.
Technical Report TR-98-020, University of Mannheim,
Germany, January 1998.
[18] G. M. Qu´enot, D. Moraru, and L. Besacier. CLIPS
at TRECVID: Shot boundary detection and feature
detection. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-252: Proceedings of
the Twelfth Text REtrieval Conference (TREC 2003),
Gaithersburg, MD, USA, 18–21 November 2003. To
appear.
[19] G. M. Qu´enot and P. Mulhem. Two systems for tem-
poral video segmentation. In Proceedings of the Eu-
ropean Workshop on Content Based Multimedia In-
dexing (CBMI’99), pages 187–194, Toulouse, France,
25–27 October 1999.
[20] R. Ruiloba, P. Joly, S. Marchand-Maillet, and G. M.
Qu´enot. Towards a standard protocol for the evalu-
ation of video-to-shots segmentation algorithms. In
Proceedings of the European Workshop on Content
Based Multimedia Indexing (CBMI’99), pages 41–48,
Toulouse, France, 25–27 October 1999.
[21] A. F. Smeaton, W. Kraaij, and P. Over. TRECVID-
2003 – An introduction. In E. M. Voorhees and
L. P. Buckland, editors, NIST Special Publication 500-
252: Proceedings of the Twelfth Text REtrieval Con-
ference (TREC 2003), Gaithersburg, MD, USA, 18–21
November 2003. To appear.
[22] A. F. Smeaton and P. Over. The TREC-2002 video
track report. In E. M. Voorhees and L. P. Buck-
land, editors, NIST Special Public ation 500-251: Pro-
ceedings of the Eleventh Text REtrieval Conference
(TREC 2002), pages 69–85, Gaithersburg, MD, USA,
19–22 November 2002.
[23] A. F. Smeaton, P. Over, and R. Taban. The TREC-
2001 video track report. In E. M. Voorhees and D. K.
Harman, editors, NIST Special Publication 500-250:
Proceedings of the Tenth Text REtrieval Conference
(TREC 2001), pages 52–60, Gaithersburg, MD, USA,
13–16 November 2001.
[24] J. R. Smith. Content-based access of image and video
libraries. Encyclopedia of Library and Information
Science, 1:40–61, 2001.
[25] J. Sun, S. Cui, X. Xu, and Y. Luo. Automatic video
shot detection and characterization for content-based
video retrieval. Proceedings of the SPIE; Visualization
and Optimisation Techniques, 4553:313–320, Septem-
ber 2001.
[26] S. M. M. Tahaghoghi, J. A. Thom, and H. E. Williams.
Shot boundary detection using the moving query win-
dow. In E. M. Voorhees and L. P. Buckland, edi-
tors, NIST Special Publication 500-251: Proceedings
of the Eleventh Text REtrieval Conference (TREC
2002), pages 529–538, Gaithersburg, MD, USA, 19–
22 November 2002.
[27] S. M. M. Tahaghoghi, J. A. Thom, H. E. Williams, and
T. Volkmer. Video cut detection using frame windows.
In submission.
[28] B. T. Truong, C. Dorai, and S. Venkatesh. New en-
hancements to cut, fade, and dissolve detection pro-
cesses in video segmentation. In R. Price, editor, Pro-
ceedings of the ACM International Conference on Mul-
timedia 2000, pages 219–227, Los Angeles, CA, USA,
30 October – 4 November 2000.
[29] T. Volkmer, S. M. M. Tahaghoghi, H. E. Williams,
and J. A. Thom. The moving query window for shot
boundary detection at TREC-12. In E. M. Voorhees
and L. P. Buckland, editors, NIST Sp ecial Publication
500-252: Proceedings of the Twelfth Text REtrieval
Conference (TREC 2003), Gaithersburg, MD, USA,
18–21 November 2003. To appear.
[30] J. Yu and M. D. Srinath. An efficient method for
scene cut detection. Pattern Recognition Letters,
22(13):1379–1391, January 2001.
[31] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar. Au-
tomatic partitioning of full-motion video. Multimedia
Systems Journal, 1(1):10–28, June 1993.
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04)
1063-6919/04 $ 20.00 IEEE
Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.