Conference PaperPDF Available

Video Cut Detection using Frame Windows.

Authors:

Abstract and Figures

Segmentation is the rst step in managing data for many information retrieval tasks. Automatic audio transcriptions and digital video footage are typically continuous data sources that must be pre-processed for segmentation into logical entities that can be stored, queried, and retrieved. Shot boundary detec- tion is a common low-level video segmentation tech- nique, where a video stream is divided into shots that are typically composed of similar frames. In this paper, we propose a new technique for nding cuts | abrupt transitions that delineate shots | that combines evidence from a xed size window of video frames. We experimentally show that our techniques are accurate using the well-known trec experimental testbed.
Content may be subject to copyright.
Video Cut Detection using Frame Windows
S. M. M. Tahaghoghi Hugh E. Williams James A. Thom Timo Volkmer
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia.
{saied,hugh,jat,tvolkmer}@cs.rmit.edu.au
Abstract
Segmentation is the first step in managing data for
many information retrieval tasks. Automatic audio
transcriptions and digital video footage are typically
continuous data sources that must be pre-processed
for segmentation into logical entities that can be
stored, queried, and retrieved. Shot boundary detec-
tion is a common low-level video segmentation tech-
nique, where a video stream is divided into shots
that are typically composed of similar frames. In
this paper, we propose a new technique for finding
cuts — abrupt transitions that delineate shots — that
combines evidence from a fixed size window of video
frames. We experimentally show that our techniques
are accurate using the well-known trec experimental
testbed.
Keywords: Shot boundary detection, cut detection,
video segmentation, video retrieval
1 Introduction
Video cameras, recorders, and editing suites are now
accessible to millions of consumers. This growth in
the availability and number of manipulation tools
has led to an explosion in the volume of data stored
in video archives, made available on the Web, and
broadcast through a wide range of media. However,
despite the urgent need for automatic techniques to
manage, store, and query video data, innovations in
video retrieval techniques have not kept pace.
For video data to be useful, its content must be
represented so that it can be stored, queried, and dis-
played in response to user information needs. How-
ever, this is a difficult problem: video has a time di-
mension, and must be reviewed sequentially to iden-
tify sections of interest. Moreover, when an interest-
ing section is identified, its content must then be rep-
resented for later search and display. Solving these
problems is both important and difficult. If video
footage is not indexed and remains inaccessible, it
will not be used. This is a problem we all face with
our home video collections; it is a far more pressing
issue for content providers and defence intelligence
organisations.
Copyright c
2005, Australian Computer Society, Inc. This pa-
per appeared at the 28th Australasian Computer Science Con-
ference (ACSC2005), The University of Newcastle, Australia.
Conferences in Research and Practice in Information Technol-
ogy, Vol. 38. V. Estivill-Castro, Ed. Reproduction for aca-
demic, not-for profit purposes permitted provided this text is
included.
This paper incorporates work from trec-11 (Tahaghoghi,
Thom & Williams 2002) and trec-12 (Volkmer, Tahaghoghi,
Thom & Williams 2003).
The expense of formal management of video re-
sources — generally through the use of textual an-
notations provided by human operators (Idris &
Panchanathan 1997) — has limited it to mostly com-
mercial environments. This process is tedious and
expensive; moreover, it is subjective and inconsistent.
In contrast, while automatic techniques may not be
as effective in understanding the semantics of video,
they are likely to be more scalable, cost-effective, and
consistent.
The basic semantic element of video footage is the
shot (Del Bimbo 1999), a sequence of frames that are
often very similar. Segmenting video into shots is of-
ten the first step in video management. The typical
second step is extracting key frames that represent
each shot, and storing these still images for subse-
quent retrieval tasks (Brunelli, Mich & Modena 1999).
For example, after extracting key frames, the system
may permit users to pose an example image as a query
and, using content-based image retrieval techniques,
show a ranked list of key frames in response to the
query. After this, the user might select a key frame
and be shown the original video content.
Two classes of transition define the boundaries be-
tween shots. Abrupt transitions or cuts are the sim-
plest and most common transition type: these are
generally used when advertisements are inserted into
television programmes, when a story is inserted into
a news programme, or in general information video.
Fades, dissolves, spatial edits, and other gradual tran-
sitions (Hampapur, Jain & Weymouth 1994) are more
complex but less frequent: these are much more com-
mon in entertainment footage such as movies and
television serials. Accurate detection of cuts, fades,
and dissolves is crucial to video segmentation; in-
deed, Lienhart (1998) reports that these transitions
account for more than 99% of all transitions across
all types of video. Video shot boundary detection is
a problem that has been extensively researched, but
achieving highly accurate results continues to be a
challenge (Smeaton, Kraaij & Over 2003). In par-
ticular, while cuts are generally easier to detect than
gradual transitions, it is not a solved problem, and
scope for improvement remains.
In this paper, we propose a new technique for ac-
curately detecting cuts. Our technique makes use of
the intuition that frames preceding a cut are similar
to each other, and dissimilar to those following the
cut. In brief, the technique works as follows. First,
for each frame in a video — the current frame — we
extract from video footage a set or window of consec-
utive, ordered frames centred on that frame. Second,
we order the frames in the window by decreasing simi-
larity to the current frame. Last, we inspect the rank-
ing of the frames, and record the number of frames
preceding the current frame in the original video that
are now ranked in the first half of the list; we call this
the pre-frame count. We repeat this process for each
frame. Cuts are detected by identifying significant
changes in the pre-frame count between consecutive
frames.
Our results show that this approach is effective.
After training on a subset of the collection used
for trec-10 experiments in 2001, we find or recall
over 95% of all cuts on the trec-11 (2002) collection,
with precision of approximately 88%. Under the qual-
ity index measure (Qu´enot & Mulhem 1999) — which
favours recall over precision — our technique achieves
around 91%. For the trec-12 (2003) collection, we
obtain recall, precision, and quality of 94%, 89%,
and 91% respectively. Importantly, our technique has
only a few parameters, and we believe these are ro-
bust across different video types and collections. We
have separately applied this principle to the detec-
tion of gradual transitions; this is discussed in detail
elsewhere (Volkmer, Tahaghoghi & Williams 2004a).
We participated in the trec-11 and trec-12
video evaluation shot boundary detection task using
an early implementation of the approach described in
this paper. This preliminary approach was highly ef-
fective: by quality index, our top run was ranked 1st
out of 54 participating runs for the cut detection
sub-task in 2002, and 26th of 76 participating runs
in 2003. Using our new approach we would have been
ranked higher in 2003, although this is perhaps an un-
fair comparison given that some time has passed since
the conference.
2 Background
Shot boundary detection techniques can be cate-
gorised as using compressed or uncompressed video.
The former consider features of the encoded footage
such as DCT coefficients, macro blocks, or motion
vectors. These techniques are efficient because the
video does not need to be fully decoded. How-
ever, using the encoded features directly can result
in lower accuracy (Boreczky & Rowe 1996, Koprinska
& Carrato 2001).
Most approaches to shot boundary detection use
uncompressed video, and typically compute differ-
ences between frames. There is generally little differ-
ence between adjacent frames that lie within the same
shot. However, when two adjacent frames span a cut,
that is, each is a member of a different shot, there is
often sufficient dissimilarity to enable cut detection.
The same technique can be applied to gradual tran-
sition detection, but this typically requires consider-
ation of the differences for many adjacent frames.
There are several methods to measure the dif-
ference between frames. In pixel-by-pixel compari-
son, the change in the values of corresponding pix-
els of adjacent frames is determined. While this
method shows good results (Boreczky & Rowe 1996),
it is computationally expensive and sensitive to cam-
era motion, camera zoom, intensity variation, and
noise. Techniques such as motion compensation and
adaptive thresholding can be used to improve the
accuracy of these comparisons (Qu´enot, Moraru &
Besacier 2003).
Most popular techniques on uncompressed video
summarise frame content using histograms. Such ap-
proaches represent a frame, or parts of a frame, by
the frequency distribution of features such as colour
or texture. For example, colour spaces are often sep-
arated into their component dimensions — such as
into the H, S, and V components of the HSV colour
space — which are then divided into discrete ranges
or bins. For each bin, a frequency is computed. The
difference between frames is computed from the dis-
tance between bin frequencies over each colour dimen-
sion using an appropriate distance metric.
Histograms have been widely used. Recent work
includes that of Heesch, Pickering, R¨uger & Yavlin-
sky (2003), who compare colour histograms across
multiple timescales. Pickering & R¨uger (2001) di-
vide frames into nine blocks, and compare histograms
of corresponding blocks. The ibm CueVideo pro-
gram extracts sampled three-dimensional rgb colour
histograms from video frames (Smith, Srinivasan,
Amir, Basu, Iyengar, Lin, Naphade, Ponceleon &
Tseng 2001), and uses adaptive threshold levels and a
state machine to detect and classify transitions. Sun,
Cui, Xu & Luo (2001) compare the colour histograms
of adjacent frames within a moving window; a shot
boundary is reported if the distance between the cur-
rent frame and the immediately preceding one is the
largest in the window, and significantly larger than
the second largest inter-frame distance in the same
window. They state shot boundary detection results
for five feature films. It is unclear how well their
methods perform for other types of video, such as
news clips or sports footage. However, as we show
in Section 4, the use of windows can lead to effective
shot boundary detection.
Another approach involves applying transforms
to the frame data. Cooper, Foote, Adcock & Casi
(2003) represent frames by their low-order DCT co-
efficients, and calculate the similarity of each frame
to the frames surrounding it. The frames before and
after a cut would have high similarity to past and fu-
ture frames respectively, but low similarity across the
boundary. Miene, Hermes, Ioannidis & Herzog (2003)
use FFT coefficients calculated from a grayscale ver-
sion of the frame for their comparisons. A detailed
overview of existing techniques is provided by Ko-
prinska & Carrato (2001).
3 The Moving Query Window Method
In this section, we propose a novel technique for cut
detection. In a similar manner to most schemes de-
scribed in the previous section, we use differences be-
tween global frame feature data. The novelty in our
approach is the ranking-based method, which is in-
spired by our previous work in content-based image
retrieval (Tahaghoghi 2002). We describe our tech-
nique in detail in this section.
3.1 Basic Approach
The key property of a cut is that it is an abrupt
boundary between two distinct shots: a cut defines
the beginning of the new shot. In general, therefore,
a cut is indicated by a frame that is dissimilar to a
window of those that precede it, but similar to a win-
dow of those that follow it. Our aim in proposing
our technique is to make use of this observation to
accurately detect cuts.
Before we explain our approach, consider the ex-
ample video fragment shown in Figure 1 that contains
a cut between the 14th and 15th frames. We use this
example to define terminology used in the remainder
of this paper. After processing the first 12 frames
in the fragment, the 13th frame is the current frame
that is being considered as a possible cut. Five pre-
frames are shown marked before the current frame,
and five post-frames follow it. Together, the pre- and
post-frames are a moving window that is centred on
the current frame; we refer to the window as mov-
ing because it is used to sequentially consider each
frame in the video as possibly bordering a cut. The
number of pre- and post-frames is always equal. We
refer to this as the half-window size or hws; in this
example, hws=5.
pre frames post frames
moving window current frame
frame number
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Figure 1: Moving query window with a half-window size (hws) of 5. The five frames before and the five frames
after the current frame form a collection on which the current frame is used as a query example.
Pre-frames Current frame (f )
cPost-frames pre-frame
count
A A A A A A A A A A AAAAAAAAAAA 5
A A A A A A A A A A A A A A A A B B B B B 7
A A A A A A A A A A ABBBBBBBBBB 10
A A A A A A A A A A BBBBBBBBBBB 0
A A A A A A B B B B B B B B B B B B B B B 2
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 2: Moving query window with hws=10. As the window traverses a cut, the number of pre-frames in
the N
2frames most similar to the current frame varies significantly. This number (the pre-frame count) rises
to a maximum just before an abrupt transition, and drops to a minimum immediately afterwards.
Consider now how to detect whether the current
frame fcborders a cut. First, a collection of frames
Cis created from the frames fcHWS . . . fc1and
fc+1 . . . fc+HWS that respectively precede and follow
the current frame fc; these frames are those from the
moving window that is centred on frame fc, but ex-
cluding fcitself. Second, the global feature data of
the frames in Cis summarised, and the distance be-
tween the current frame fcand each frame in Cis
computed. Third, the frames are ordered by increas-
ing distance from the current frame to achieve a rank-
ing. Last, we consider only the first |C|
2top-ranked
frames — which is equal to the hws — and record
the number that are pre-frames; we refer to the num-
ber of pre-frames in the |C|
2top-ranked frames as the
pre-frame count. If the value of the pre-frame count
is zero (or close to zero), it is likely that a cut has
occurred. In practice, we consider the results of com-
puting the pre-frame count for several adjacent frames
to improve cut detection reliability.
Consider again Figure 1. The current (13th) frame
is not the first in a new shot and therefore does not
define a cut. We expect that the five pre-frames and
the first post-frame — all from the first shot — would
be ranked as more similar to the current frame than
the remaining post-frames. Therefore, when inspect-
ing the first |C|
2ranked frames, either four or five are
pre-frames (the pre-frame count is four or five), and
a cut is unlikely to be present.
3.2 Combining Results
In the previous section, we explained our simple ap-
proach to detecting cuts using ranking. In this sec-
tion, we explain how the rankings from the moving
window approach can be combined for effective cut
detection.
A representation of a video is shown in Figure 2.
The video contains two shots — labelled A and B,
where A occurs immediately before B — and we use
hws=10 for our moving query window; hence, our
frame collection contains 20 frames, 10 each from the
pre- and post-frames. The figure shows five differ-
ent situations that occur as the video is sequentially
processed with our algorithm:
The first row shows the situation where the mov-
ing window is entirely within shot A. On comput-
ing the distance between the current frame and
each of the pre- and post-frames, we find the pre-
frame count to be 5; this is because the pre- and
post-frames are approximately equally similar to
the current frame. As the pre-frame count is not
near-zero, our algorithm described in the previ-
ous section does not report a cut.
The second row shows where frames of shot B en-
ter the window. The ranking process determines
that 7 of the most similar 10 frames are pre-
frames. Compared to the first row, the pre-frame
count is larger because the frames from shot B
are less similar to the current frame, and are
therefore ranked below all frames from shot A.
Again, since the pre-frame count is not near-zero,
a cut is not reported.
In the third row, the current frame is the last
in the first shot. The ranking determines that
the pre-frame count is at the maximum value
of 10 — since all post-frames are ranked below
all pre-frames — and since this is not near-zero,
a cut is not reported.
The fourth row shows the case where the current
frame is the first in shot B. Here, the post-frames
are all more similar to the current frame than the
pre-frames are, and so the pre-frame count is 0.
Hence, our algorithm reports a cut.
1000 1050 1100 1150 1200
HWS
(10)
UB
(8)
LB
(2)
pre frames
cut
dissolve
Figure 3: The ratio of pre-frames to post-frames in the first half of the ranked results plotted for a 200-frame
interval on a trec video using hws=10.
The final row shows what happens as the frames
from shot B enter the pre-frame half-window.
Some of the pre-frames are now similar to the
current frame, and so the pre-frame count in-
creases to 2. Since we reported a cut for the pre-
vious frame, we do not report another one here.
This simple example illustrates a general trend of
our ranking approach. When frames of only one shot
are present in the moving window, the ratio of pre-
frames to post-frames ranked in the top |C|
2frames
is typically 1. As a new shot enters the post-frames
of the window, this ratio increases. When the first
frame of the new shot becomes the current frame, the
ratio rapidly decreases. Then, as the new shot enters
the pre-frames, the ratio stabilises again near 1.
Consider a real-world example. Figure 3 shows
a 200-frame interval of a video from the trec-10
collection (Smeaton, Over & Taban 2001). One dis-
solve and four cuts are known to have occurred, and
are marked by a dashed line and crosses. The solid
line shows the number of pre-frames present in the
top-ranked |C|
2frames, that is, the pre-frame count.
Where a cut occurs, the pre-frame count rises just
before the cut, and falls immediately afterwards.
We use the change in the pre-frame count over
adjacent frames to accurately detect cuts. We set an
upper threshold that the pre-frame count must reach
prior to a cut. When this occurs, we test whether
the pre-frame count falls below a minimum threshold
within a few frames. A cut is reported if the pre-frame
count traverses both thresholds. As we show later, we
have found that fixed threshold values perform well
across a wide range of video footage.
We make several implicit assumptions. First,
when adjacent frames span a cut, we expect the value
of the pre-frame count to fall from near |C|
2to 0 within
a few frames; we capture this assumption by mon-
itoring the pre-frame count slope for large negative
values.
Second, there can be significant frame differences
within a shot and so we specify that the pre- and post-
frames spanning a cut must be reasonably different.
For this, we apply two empirically-determined thresh-
olds. We require the last pre-frame and the first post-
frame to have a difference of at least 25% of the max-
imum possible inter-frame difference. We also specify
that the average difference between fcand the top |C|
2
frames must be at least half the corresponding value
for the lower |C|
2frames.
Last, in accordance with the trecvid decision
that a cut may stretch over up to six frames (Smeaton
et al. 2001), we allow up to four consecutive frames
to satisfy our criteria. Some action feature films and
trailers contain sections of video with very short shots
of only a few frames (a fraction of a second) each. Our
scheme cannot separate shots shorter than six frames;
however, it can be argued that viewers often cannot
separate these short shots either, and see them as be-
ing part of a single sequence.
4 Results
In this section, we discuss the measurement tech-
niques and experimental environment we used to eval-
uate our approach. We then present overall results,
and discuss the effect of parameter choices on our
technique.
We measure effectiveness using the well-known
recall and precision measures (Witten, Moffat &
Bell 1999). Recall measures the fraction of all known
cuts that are correctly detected, and precision indi-
cates the fraction of detected cuts that match the an-
notated cuts.
We also report cut quality (Qu´enot & Mulhem
1999), a measure that combines the two into a single
indicator that captures the trade-off between recall
and precision, while favouring recall:
Quality = Recall
3×(4 (1
Precision ))
For our experiments, we used three trecvid col-
lections containing varied types of footage. We
developed our approach and tuned our thresholds
using only the trec-10 video collection (Smeaton
et al. 2001), and selected parameter settings that
maximise the quality index. We then carried out
blind runs on the trec-11 (Smeaton & Over 2002)
and trec-12 (Smeaton et al. 2003) test collections.
The trec-11 test collection contains 18 video clips
with an average length of 30 281 frames, and a total
of 1 466 annotated cuts, while the trec-12 test col-
lection contains 13 video clips with an average length
of 45 850 frames, and a total of 2 364 annotated cuts.
The collections also contain annotated gradual tran-
sitions that we do not use in the work reported here,
but explore elsewhere (Volkmer et al. 2004a).
4.1 Overall Results
Tables 1 and 2 show the effectiveness of our approach
on the trec-11 and trec-12 collections: recall is
typically 94%–96% and precision 87%–90% for the
three best parameter settings we determined from
the trec-10 collection. Importantly, because the
cut quality measure favours recall over precision, our
technique has a cut quality of 90%–91%. Overall,
therefore, our scheme finds around 19 out of 20 cuts,
and only around 1 in 10 cuts that are detected are
false alarms; as we discuss later, there are several sup-
plementary techniques that can be applied to improve
these results further.
Also listed in the tables are the average results
for our submissions to the corresponding trec work-
shops, and the average results for all other partici-
Description hws lb ub Recall Precision Quality Rank
6 1 5 95.7% 88.6% 91%
MVQ current 6 2 5 95.7% 87.1% 90%
7 3 6 95.7% 88.2% 91%
trec-11 mean, MVQ 85.8% 90.8% 83% 27
trec-11 mean, Others 85.0% 81.9% 79% 34
Table 1: Results for blind runs of the current Moving Query Window (MVQ) implementation on the trec-11
video collection. The bottom two rows show actual workshop results averaged over all runs for the MVQ
approach, and the average for runs submitted by other groups. The last column shows the comparative rank
of the means among the 52 participating runs. The best MVQ run was ranked 1st.
Description hws lb ub Recall Precision Quality Rank
6 1 5 93.6% 90.0% 90%
MVQ current 6 2 5 94.7% 89.1% 91%
7 3 6 94.2% 89.5% 91%
trec-12 mean, MVQ 92.2% 85.7% 87% 30
trec-12 mean, Others 85.2% 87.0% 81% 52
Table 2: Results for blind runs of the current Moving Query Window (MVQ) implementation on the trec-12
video collection. The bottom two rows show actual workshop results averaged over all runs for the MVQ
approach, and the average for runs submitted by other groups. The last column shows the comparative rank
of the means among the 76 participating runs. The best MVQ run was ranked 26th.
pating groups. Our approach performed better than
most other systems, and has now considerably im-
proved. Results for the trecvid 2004 shot boundary
detection task became available very recently. A to-
tal of 141 runs were submitted for the shot boundary
detection task. All twenty of the runs we submitted
for the moving query window approach appeared in
the top twenty-two runs by cut quality. Details of
these runs appear elsewhere (Volkmer, Tahaghoghi &
Williams 2004b).
4.2 Features
We experimented with one-dimensional global his-
tograms using the hsv,cielab, and cieluv colour
spaces (Watt 1989), and a fourth feature formed
from the coefficients of a 6-tap Daubechies wavelet
transform (Daubechies 1992, Williams & Amaratunga
1994) of the frame ycbcrcolour data. We used a
range of feature detail settings, and employed the
Manhattan (city-block) measure to compute the dis-
tance between frames.
Using the lowest five sub-bands of the wavelet data
produced the best detection results, although it is
slower to extract and process than the colour features,
and is less effective for detecting gradual transitions.
Global colour histograms summarise a frame by its
colour frequency distribution. This makes them rela-
tively robust to object and camera motion, although
the loss of spatial information can make transition
detection difficult in some cases. We found the best-
performing colour feature to be the hsv colour space
used in global histograms with 128 bins per compo-
nent.
4.3 Half-window Size
To determine the best size of the moving window,
we experimented with half-window sizes (hws) of be-
tween 6 and 20 frames, using appropriate lower and
upper bounds for each half-window size; these bounds
are discussed in the next section. Figure 4 shows that
cuts are accurately detected when the half-window
size is between 6 and 8 frames for the wavelet feature,
and between 8 to 10 frames for global hsv colour his-
tograms. For both features, we experimented with
different bin sizes and settings that we do not dis-
cuss in detail; the optimal settings for each feature
are reported in the previous section.
Small window sizes are preferable as they min-
imise the amount of computation required. How-
ever, a very small window size increases the sensitivity
to frame variations within a shot, thereby increasing
false alarms. Our results also show that the global
hsv histogram feature is more sensitive to this pa-
rameter than the wavelet feature.
Although our focus is on effectiveness rather than
efficiency, it is interesting to note that the process-
ing cost of our approach is largely dependent on the
number of coefficients used in the feature histograms,
and on the half-window size. Using the wavelet fea-
ture data with 1 176 coefficients and hws=6, our al-
gorithm processes previously-extracted frame feature
data at the rate of more than 3 700 frames per second
on a Pentium-III 733 personal computer. When us-
ing the 384-bin hsv colour data, the processing rate
is almost 9 400 frames per second. The correspond-
ing texture and colour feature extraction stages cur-
rently operate at 2.6 and 11 frames per second re-
spectively. Very little is published about the process-
ing speed of comparable approaches, although Smith
et al. (2001) note that their system runs at “about 2X
real time on a 800MHz P-III”, which translates to ap-
proximately 55 frames per second.
4.4 Upper and Lower Bounds
The lower bound (lb) and upper bound (ub) deter-
mine the relative priorities of recall and precision.
Varying lb has a relatively minor effect on cut detec-
tion, since the pre-frame count often actually reaches
zero at the cut boundary. As Figure 5(a) shows, de-
creasing lb generally increases precision but causes a
slight drop in recall; again, we show several different
settings for the hws and features to illustrate the gen-
eral trends for a range of settings. In contrast, raising
the level of ub towards hws tends to increase preci-
sion while decreasing recall. Figure 5(b) illustrates
this behaviour.
To maximise the results under the quality index,
we use parameters that afford high recall with mod-
erate precision. We have found that a lower bound of
between 1 and 3, and an upper bound of around one
4 6 8 10 12 14 16 18 20
Half-window Size
90
92
94
96
98
Quality (%)
Wavelet coefficients
HSV histogram
Figure 4: Effect of varying the half-window size (hws) on detection performance for two features over a range
of lower and upper bounds. Increasing hws generally lowers detection quality. Cuts are best detected with the
wavelet feature and hws set to between 6 and 10 frames.
1 2 3 4
(a) Lower Bound
90
95
100
Precision/Recall (%)
23456
(b) Upper Bound
90
95
100
Recall
Precision
Figure 5: (a) Reducing lb increases precision at the cost of recall. (b) Increasing ub improves precision, but
reduces recall.
less than the hws produce the best results. Overall,
the optimal parameters on the trec-10 collection us-
ing the wavelet feature are a half-window size hws=6,
a lower bound lb=2, and an upper bound ub=5. A
detailed discussion of these results is presented else-
where (Tahaghoghi 2002).
5 Conclusion
Video segmentation is a crucial first step in process-
ing video for retrieval applications. In this paper, we
have described our method to detecting transitions
in video, focusing here on the identification of abrupt
transitions or cuts in digital video. This technique
makes use of the observation that the frames compris-
ing the conclusion of one shot are typically dissimilar
to those that begin the next. The algorithm incorpo-
rates a moving window that considers each possible
cut in the context of the frames that surround it.
We have shown experimentally that our approach
is highly effective on very different test collections
from the trec video track. After tuning our scheme
on one collection, we have shown that it achieves a
cut quality index of around 90% on two other col-
lections. Importantly, our approach works well with-
out applying additional pre-filtering stages such as
motion compensation (Qu´enot et al. 2003), and has
only a few intuitive parameters that are robust across
very different collections. We believe that our tech-
nique is a valuable new tool for accurate cut detec-
tion. We have also applied a variant of this approach
to the detection of gradual transitions, with good re-
sults (Volkmer et al. 2004a).
We are currently investigating several improve-
ments to our algorithms. These include using dy-
namic thresholds for both abrupt and gradual transi-
tions, local histograms, and an edge-tracking feature.
We also plan to explore whether excluding selected
frame regions — specifically the area of camera fo-
cus — from the comparison stage can reduce the false
detection rate for difficult video clips, and in this way
help achieve even more effective automatic segmenta-
tion of video.
References
Boreczky, J. S. & Rowe, L. A. (1996), ‘Comparison of
video shot boundary detection techniques’, Jour-
nal of Electronic Imaging 5(2), 122–128.
Brunelli, R., Mich, O. & Modena, C. M. (1999),
‘A survey of the automatic indexing of video
data’, Journal of Visual Communication and Im-
age Representation 10(2), 78–112.
Cooper, M., Foote, J., Adcock, J. & Casi, S. (2003),
Shot boundary detection via similarity analysis,
in ‘Proceedings of the TRECVID 2003 Work-
shop’, Gaithersburg, Maryland, USA, pp. 79–84.
Daubechies, I. (1992), Ten Lectures on Wavelets, So-
ciety for Industrial and Applied Mathematics,
Philadelphia, Pennsylvania, USA.
Del Bimbo, A. (1999), Visual Information Retrieval,
Morgan Kaufmann Publishers Inc.
Hampapur, A., Jain, R. & Weymouth, T. (1994), Dig-
ital video segmentation, in ‘Proceedings of the
ACM International Conference on Multimedia’,
San Francisco, California, USA, pp. 357–364.
Heesch, D., Pickering, M. J., uger, S. & Yavlin-
sky, A. (2003), Video retrieval within a browsing
framework using key frames, in ‘Proceedings of
the TRECVID 2003 Workshop’, Gaithersburg,
Maryland, USA, pp. 85–95.
Idris, F. M. & Panchanathan, S. (1997), ‘Review of
image and video indexing techniques’, Journal
of Visual Communication and Image Represen-
tation 8(2), 146–166.
Koprinska, I. & Carrato, S. (2001), ‘Temporal video
segmentation: A survey’, Signal Processing: Im-
age Communication 16(5), 477–500.
Lienhart, R. W. (1998), ‘Comparison of automatic
shot boundary detection algorithms’, Proceed-
ings of the SPIE; Storage and Retrieval for Still
Image and Video Databases VII 3656, 290–301.
Miene, A., Hermes, T., Ioannidis, G. T. & Her-
zog, O. (2003), Automatic shot boundary detect-
ing using adaptive thresholds, in ‘Proceedings of
the TRECVID 2003 Workshop’, Gaithersburg,
Maryland, USA, pp. 159–165.
Pickering, M. & R¨uger, S. M. (2001), Multi-timescale
video shot-change detection, in ‘NIST Spe-
cial Publication 500-250: Proceedings of the
Tenth Text REtrieval Conference (TREC 2001)’,
Gaithersburg, Maryland, USA, pp. 275–278.
Qu´enot, G. M., Moraru, D. & Besacier, L. (2003),
CLIPS at TRECVID: Shot boundary detec-
tion and feature detection, in ‘Proceedings of
the TRECVID 2003 Workshop’, Gaithersburg,
Maryland, USA, pp. 35–40.
Qu´enot, G. & Mulhem, P. (1999), Two systems for
temporal video segmentation, in ‘Proceedings of
the European Workshop on Content Based Mul-
timedia Indexing (CBMI’99)’, Toulouse, France,
pp. 187–194.
Smeaton, A. F., Kraaij, W. & Over, P. (2003),
TRECVID-2003 – An introduction, in ‘Pro-
ceedings of the TRECVID 2003 Workshop’,
Gaithersburg, Maryland, USA, pp. 1–10.
Smeaton, A. F. & Over, P. (2002), The TREC-2002
video track report, in ‘NIST Special Publication
500-251: Proceedings of the Eleventh Text RE-
trieval Conference (TREC 2002)’, Gaithersburg,
Maryland, USA, pp. 69–85.
Smeaton, A., Over, P. & Taban, R. (2001), The
TREC-2001 video track report, in ‘NIST Spe-
cial Publication 500-250: Proceedings of the
Tenth Text REtrieval Conference (TREC 2001)’,
Gaithersburg, Maryland, USA, pp. 52–60.
Smith, J. R., Srinivasan, S., Amir, A., Basu, S., Iyen-
gar, G., Lin, C. Y., Naphade, M. R., Ponceleon,
D. B. & Tseng, B. L. (2001), Integrating fea-
tures, models, and semantics for TREC video
retrieval, in ‘NIST Special Publication 500-250:
Proceedings of the Tenth Text REtrieval Con-
ference (TREC 2001)’, Gaithersburg, Maryland,
USA, pp. 240–249.
Sun, J., Cui, S., Xu, X. & Luo, Y. (2001), ‘Auto-
matic video shot detection and characterization
for content-based video retrieval’, Proceedings of
the SPIE; Visualization and Optimisation Tech-
niques 4553, 313–320.
Tahaghoghi, S. M. M. (2002), Processing Similarity
Queries in Content-Based Image Retrieval, PhD
thesis, RMIT University, School of Computer
Science and Information Technology, Melbourne,
Australia.
Tahaghoghi, S. M. M., Thom, J. A. & Williams, H. E.
(2002), Shot boundary detection using the mov-
ing query window, in ‘NIST Special Publication
500-251: Proceedings of the Eleventh Text RE-
trieval Conference (TREC 2002)’, Gaithersburg,
Maryland, USA, pp. 529–538.
Volkmer, T., Tahaghoghi, S. M. M., Thom, J. A. &
Williams, H. E. (2003), The moving query win-
dow for shot boundary detection at TREC-12, in
‘Proceedings of the TRECVID 2003 Workshop’,
Gaithersburg, Maryland, USA, pp. 147–156.
Volkmer, T., Tahaghoghi, S. M. M. & Williams, H. E.
(2004a), Gradual transition detection using av-
erage frame similarity, in ‘Proceedings of the
4th International Workshop on Multimedia Data
and Document Engineering (MDDE-04)’, IEEE
Computer Society, Washington, DC, USA.
Volkmer, T., Tahaghoghi, S. M. M. & Williams, H. E.
(2004b), RMIT University at TRECVID-2004,
in ‘Proceedings of the TRECVID 2004 Work-
shop’, Gaithersburg, Maryland, USA. To ap-
pear.
Watt, A. H. (1989), Fundamentals of Three-
Dimensional Computer Graphics, Addison Wes-
ley, Wokingham, UK.
Williams, J. R. & Amaratunga, K. (1994), ‘Introduc-
tion to wavelets in engineering’, International
Journal for Numerical Methods in Engineering
37(14), 2365–2388.
Witten, I. H., Moffat, A. & Bell, T. C. (1999), Manag-
ing Gigabytes: Compressing and Indexing Doc-
uments and Images, second edn, Morgan Kauf-
mann Publishers Inc.
... The first algorithm to be described is presented in [15], and makes use of a technique called frame windows. Their technique makes use of the intuition that frames preceding a cut are similar to each other, and dissimilar to those following the cut. ...
... A good illustration of this approach is shown in figure 2.5. The figure is taken from [15] The figure shows a cut between the 14th and the 15th frame. After processing the first 12 frames in the fragment, the 13th frame is the current frame that is being considered as a possible cut. ...
... A representation of a video is shown in figure 2.6. The figure is taken from [15]. The video contains two shots labelled A and B, and the HWS value is set to 10. ...
... The simplest way to detect an AT is pixel-wised method [6~8], but the results are not quite acceptable, because the pixel difference of two frames can only carry little useful information and usually causes confusions. A reasonable improvement is to use color histograms of images [1], [9], [10] as the feature, which is also an effective way for AT detection. As an enhancement of current approaches, another novel thought [11~13] integrates several simple methods, draw their advantages and also have a good result. ...
... In this paper, we propose an effective approach which is based on a moving-window framework to detect ATs and some simple GTs. Firstly, we adopt an existing concept of moving window [1], [2]. Secondly, we use 256-bin histograms in RGB color space and Laplace transform of frames as the measurement of similarity. ...
... In this section, we briefly review and analyze two framewindow methods [1], [2]. The first algorithm [1] makes good results for ATs, whose process is shown below. ...
Conference Paper
Full-text available
Shot boundary detection is an important phase in the process of video retrieval, thus an automatic way to do this work is imperative. In this paper, we present an improved moving-window approach which can tackle this problem effectively. In our method, an existing concept of moving window is adopted. Then the Manhattan distance of histograms in RGB color space and the Laplace transform of each frame are calculated as the measurement of similarity. And finally each frame is examined through several classifiers to determine whether it is a component of a transition. The innovation of this paper is to apply Laplace transform as a supplementation of current features, and to propose a flexible video-dependent threshold, which leads to a higher precision. Experiments show that our approach can generate preferable results in most cases compared with two previous methods.
... Many approaches for cut detection are based on the comparison of two consecutive frames. More recent approaches (Tahaghoghi et al 2005, Yuan et al 2005 compare all images within a short time window with each other to get more robust results. In the TRECVID evaluation series 2 such approaches could achieve the best recognition rates in 2005: 95% of the cuts were found (recognition rate, "recall"), and 95% of all positions reported by these detectors were indeed cuts ("precision" of the result). ...
... The cut detection usually relies on detecting major changes in visual features between consecutive video frames, for example using saliency maps and SSIM measure [14,25]. The existing methods for video cut detection usually rely on global color histograms, differences of which must be larger than certain (sometimes adaptive) threshold [17,19,21,22]. The methods also vary depending on the specific content domain to which they are suited (e.g., action movie versus football match). ...
Article
Full-text available
In usability studies involving eye-tracking, quantitative analysis of gaze data requires the information about so called scene occurrences. Scene ocurrences are time segments during which the application user interface remains more-less static, so gaze events (e.g., fixations) can be mapped to the particular areas of interest (user interface elements). The scene occurrences typically start and end by user interface changes such as page-to-page transitions, menu expansions, overlay propmts, etc. Normally, one would record such changes programmatically through application logging, yet in many studies, this is not possible. For example, in an early-prototype mobile-app testing, only a camera recording of a smart device screen is often available as evidence. In such cases, analysts must manually annotate the recordings. To reduce the need for manual annotation of scene occurrences, we present an image processing method for segmenting user interface video recordings. The method exploits specific properties of user interface recordings, which greatly differ from real world video shots (for which many segmentation methods exist). The core of our method lies in the use of SSIM and SIFT similarity metrics used on video frames (with several pre-processing and filtering procedures). The main advantage of our method is, that it requires no training data apart from single screenshot example for each scene (to which the recording frames are compared). The method is also able to work with user finger overlays, which are always present in mobile device recordings. We evaluate the accuracy of our method over recordings from several real-life studies and compare it with other image similarity techniques. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
... Many approaches for cut detection are based on the comparison of two consecutive frames. More recent approaches (Tahaghoghi et al [2005], Yuan et al [2005]) compare all images within a short time window with each other to get more robust results. In the TRECVID evaluation series 2 such approaches could achieve the best recognition rates in 2005: 95% of the cuts were found (recognition rate, "recall"), and 95% of all positions reported by these detectors were indeed cuts ("precision" of the result). ...
Article
Full-text available
Within the research project "Methods and Tools for Computer-Assisted Media Analysis" funded by Deutsche Forschungsgemeinschaft, we have developed the software toolkit Videana to relieve media scholars from the time-consum-ing task of annotating videos and films manually. In this paper, we present the automatic analysis tools and the graphical user interface (GUI) of Videana. The following automatic video content analysis approaches are part of Videana: shot boundary detection, camera motion estimation, detection and recognition of superimposed text, detection and recognition of faces in a video, and audio segmentation. The GUI of Videana allows the user to subsequently correct er-roneous detection results and to insert user-defined comments or keywords at the shot level. Furthermore, several research applications of Videana are dis-cussed. Finally, experimental results are presented for the content analysis ap-proaches and compared to the quality of human annotations.
... Sales of digital devices, like cams and smartphones, capable of capturing digital images have been expanding every year. This fact have led to a significant raise on digital content (photos, videos and audio) generation [37]. Websites, like Youtube 1 and Instagram 2 , which provide services for people to share digital content, have become very popular nowadays mainly because people can choose when they want to access a specific content. ...
Conference Paper
Full-text available
Recent advances in technology have increased the availability of video data, creating a strong requirement for efficient systems to manage those materials. To make efficient use of video information, first, the data has to be automatic segmented into smaller, manageable and understandable units, like scenes. This paper presents a new, multimodal video scene segmentation technique. The proposed approach is to combine Bag of Features based techniques (visual and aural) in order to explore the latent semantic obtained by them in complementary way, improving scene segmentation. The results achieved showed to be promising.
Conference Paper
Full-text available
Today, with the availability of large amount of video files, searching for videos with desired content is becoming a tedious task. The viewers require better control over the video data and for this reason the video browsing and indexing applications are being developed. These applications are required to segment the video into shots in the initial step. Shot boundary detection for shots separated by abrupt changes has been successful to a large extent, but detecting shot boundaries with gradual transitions in between, has been very challenging. In this paper, we propose a dual stage divide-and-merge approach to detect shots joined by dissolve type transitions. In the first stage we declare the video frames to be of either dissolve type or non-dissolve type. Later, we iteratively combine the sequences of non-dissolve frames that constitute the different shots. For experimental purpose, we have used the videos downloaded from the free video sharing website YouTube. The experimental results prove that the proposed technique is indeed effective.
Article
Reliable video content analysis is an essential prerequisite for effective video search. An important current research question is how to develop robust video content analysis methods that produce satisfactory results for a large variety of video sources, distribution platforms, genres, and content. The work presented in this article exploits the observation that the appearance of objects and events is often related to a particular video sequence, episode, program, or broadcast. This motivates our idea of considering the content analysis task for a single video or episode as a transductive setting: the final classification model must be optimal for the given video only, and not in general, as expected for inductive learning. For this purpose, the unlabeled video test data have to be used in the learning process. In this article, a transductive learning framework for robust video content analysis based on feature selection and ensemble classification is presented. In contrast to related transductive approaches for video analysis (e.g., for concept detection), the framework is designed in a general manner and not only for a single task. The proposed framework is applied to the following video analysis tasks: shot boundary detection, face recognition, semantic video retrieval, and semantic indexing of computer game sequences. Experimental results for diverse video analysis tasks and large test sets demonstrate that the proposed transductive framework improves the robustness of the underlying state-of-the-art approaches, whereas transductive support vector machines do not solve particular tasks in a satisfactory manner.
Conference Paper
Full-text available
This paper describes the contribution of the TZI to the shot detection task of the TREC 2003 video analysis track (TRECVID). The approach comprises a feature extraction step and a shot detec- tion step. In the feature extraction, three features are extracted: a frequency-domain approach based on FFT-features, a spatial-domain ap- proach based on changes in the image luminance values, and another spatial domain approach based on gray level histogram dierences. Shot boundary detection uses then adaptive thresholds based on all extracted features of the complete video. The final shot list is a combination of shots which result from an independent examination of all three features.
Article
Full-text available
In this paper, we present a framework for analyzing video using self-similarity. Video scenes are located by analyzing inter-frame similarity matrices. The approach is flexible to the choice of both feature parametriza-tion and similarity measure and it is robust because the data is used to model itself. We present the approach and its application to shot bound-ary detection.
Article
Full-text available
Digital video is widely used in multimedia databases and requires effective retrieval techniques. Shot bound-ary detection is a common first step in analysing video content. The effective detection of gradual transitions is an especially difficult task. Building upon our past research work, we have designed a novel decision stage for detection of gradual transitions. Its strength lies particularly in the accurate detection of gradual tran-sition boundaries. In this paper, we describe our mov-ing query window method and discuss its performance in the context of the trec-12 shot boundary detection task. We believe this approach is a valuable contribu-tion to video retrieval and worth persueing in the fu-ture.
Book
Watt provides a comprehensive introduction to the techniques needed to produce shaded images of three-dimensional solids on a computer graphics monitor. Strongly based on algorithm understanding.
Article
In this paper, firstly, several video shot detection technologies have been discussed. An edited video consists of two kinds of shot boundaries have been known as straight cuts and optical cuts. Experimental result using a variety of videos are presented to demonstrate that moving window detection algorithm and 10-step difference histogram comparison algorithm are effective for detection of both kinds of shot cuts. After shot isolation, methods for shot characterization were investigated. We present a detailed discussion of key-frame extraction and review the visual features, particularly the color feature based on HSV model, of key-frames. Video retrieval methods based on key-frames have been presented at the end of this section. This paper also present an integrated system solution for computer- assisted video parsing and content-based video retrieval. The application software package was programmed on Visual C++ development platform.
Article
Today a considerable amount of video data in multimedia databases requires sophisticated indices for its effective use. Manual indexing is the most effective method to do this, but it is also the slowest and the most expensive. Automated methods have then to be developed. This paper surveys several approaches and algorithms that have been recently proposed to automatically structure audio–visual data, both for annotation and access.Copyright 1999 Academic Press.
Article
Various methods of automatic shot boundary detection have been proposed and claimed to perform reliably. Although the detection of edits is fundamental to any kind of video analysis since it segments a video into its basic components, the shots, only few comparative investigations on early shot boundary detection algorithms have been published. These investigations mainly concentrate on measuring the edit detection performance, however, do not consider the algorithms' ability to classify the types and to locate the boundaries of the edits correctly. This paper extends these comparative investigations. More recent algorithms designed explicitly to detect specific complex editing operations such as fades and dissolves are taken into account, and their ability to classify the types and locate the boundaries of such edits are examined. The algorithms' performance is measured in terms of hit rate, number of false hits, and miss rate for hard cuts, fades, and dissolves over a large and diverse set of video sequences. The experiments show that while hard cuts and fades can be detected reliably, dissolves are still an open research issue. The false hit rate for dis-solves is usually unacceptably high, ranging from 50% up to over 400%. Moreover, all algorithms seem to fail under roughly the same conditions.
Article
The aim of this paper is to provide an introduction to the subject of wavelet analysis for engineering applications. The paper selects from the recent mathematical literature on wavelets the results necessary to develop wavelet-based numerical algorithms. In particular, we provide extensive details of the derivation of Mallat's transform and Daubechies' wavelet coefficients, since these are fundamental to gaining an insight into the properties of wavelets. The potential benefits of using wavelets are highlighted by presenting results of our research in one- and two-dimensional data analysis and in wavelet solutions of partial differential equations.