JPEG2000 BASED SCALABLE SUMMARY FOR REMOTE VIDEO
CONTENT BROWSING AND EFFICIENT SEMANTIC STRUCTURE
7000 Mons, Belgium
E-mail: Jerome.meessen@ multitel.be
Broadband Applications Research Centre
BT Research & Venturing
Adastral Park, Ipswich IP5 3RE, UK
Communication and Remote Sensing Lab.
This paper presents a new method for remote and interactive browsing of long
video sequences. The solution is based on interactive navigation in a scalable
mega image resulting from a JPEG2000 coded keyframe-based video summary.
The presented system is compliant with the new JPEG2000 part 9 “JPIP –
JPEG2000 Interactivity, API and Protocol,” which lends itself to working
under varying channel conditions such as wireless networks. The flexibility
offered by JPEG2000 allows the application to highlight interactively
keyframes corresponding to the desired content first within a low quality and
low-resolution version of the full video summary. It then offers a fine grain
scalability for a user to navigate and zoom in to particular scenes or events
represented by the keyframes. This possibility to visualise keyframes of interest
and playback the corresponding video shots within the context of the whole
sequence enables the user to understand the temporal relations between
semantically similar events, i.e. a new way to analyse long video sequences.
c This work was partially supported by the EU FP5 project SCHEMA, IST-2001-32795.
With the advent of digital revolution, the increasing network connectivity and
bandwidth, as well as the mass production and distribution of rich audio-visual
media, the demands from end users are becoming ever more urgent for a fast and
easy access to video program summaries in order to browse and visualize
desirable contents . A video content summary often takes the form of a 2-D
presentation on a visualisation interface, which is made up of selected frames, or
keyframes, representing semantically related data chunks, i.e. shots or events.
Depending on the media genre and applications, many different layouts for
keyframes presentation are possible, as discussed in .
However, summarizing the content of a long video sequence this way for
entertainment genres such as a feature movie or drama, the number of selected
keyframes is still way too many. Two problems are ensued. First, the semantic
story structures will be largely buried in the numerous images displayed. And,
secondly, in the case of a user accessing the summary stored on a remote server,
the transmission over the network of a large number of keyframes is a major
problem. Today, there exist different approaches to addressing these issues.
Building condensed and semantically relevant video summaries has been seen in
the work by Yeung and Yeo  and by Chiu et al. . However, though these
summaries provide a good overview of a sequence, they tend to present to the
user one pre-defined semantic level illustration of the sequence only.
Hierarchical shots clustering and presentation is another common approach to
browsing either one video sequence  or a video sequences database .
In particular, the shot clustering and browsing methods described  and 
are particularly interesting since they are evaluated regarding the amount of
transmitted data at each user retrieval request. . These are efficient solutions to
getting a quick overview of a video sequence and to finding a particular scene of
interest. However, they do not have provisions for contextual visualization of the
links between semantically similar scenes, i.e., to answer a user’s queries like
“What happened before and after that particular event?” “Are there any other
similar events taking place in the story, and if so, what are their temporal
In this paper, we focus on helping the user to understand the underlying
semantic structure of a video sequence, i.e. to establish relations between
semantically similar scenes and highlight them within the context of the whole
sequence. Rather than to propose a complex shot clustering strategy or
storyboard layout, we exploit the user’s intelligence by providing him/her with
interactive tools for intuitive navigation in a remotely stored scalable summary.
The idea is to exploit the powerful features of compression and scalable
representation in JPEG2000, the new standard for still image compression ,
and produce scalable keyframe-based summaries of a video sequence, while
allowing for at the same time semantics-based queries. JPEG2000 allows a
flexible access to each spatial region of the compressed image, at a different
resolution and PSNR quality level . This is particularly suited to browsing
very large images as discussed in . We present here a layered platform
compliant with the forthcoming JPEG2000 Part 9 “JPIP – JPEG2000 interactive
protocol” . While storing only one detailed keyframe based video
summary, or storyboard, this standard communication between server and client
provides the means to transmit efficiently many different versions of the
storyboard over a network. Moreover, JPIP offers means to adapt the
transmission to changing channel conditions allowing an efficient transmission
of the summary data with any type of channel conditions and user processing
resources. This particularly suits video browsing using mobile devices.
The annotation of shots and scenes of the video summary is based on
MPEG-7 visual content description schemas . After the temporal
decomposition of a video sequence into segments, i.e. shots or scenes, and
necessary manual annotation of their contents, an MPEG-7 compliant XML
description file specifies, for each of these segments, a number of attributes,
including the text annotations (scene, object, action) and time information – the
start and duration of the segment, and the position of the keyframe selected for
the shot . The annotations of shots allow translating content-based queries
into image-oriented requests.
2. System Framework
This section discusses the two core components underlying the proposed video
content browsing and retrieval system.
2.1. Scalable keyframe-based video summary with annotation
The work flow used to create the coded keyframe-based video summary is as
follows. The original video sequence is segmented into shots, and one keyframe
is selected for each shot to represent its visual content. The representation
scheme can be extended to a group of shots, or sub-scene or scene, to avoid the
redundancy in displayed visual content, as in the case of “A Beautiful Mind”
video discussed in the experimental section. The keyframe images are then
arranged in raster scanning order to compose a large mosaic image, which is then
JPEG2000 compressed to output two files: the JPEG2000 codestream and its
associated index file as defined in JPEG2000 Part 9 ‘JPIP’. The compression is
done with at least two quality levels, to be able to highlight keyframes of interest,
and with different resolution levels. Moreover, the JPEG2000 coding parameters
are chosen such that each keyframe can be accessed separately.
Automatic detection of shot boundaries and the selection of keyframe(s) for
each shot / scene are carried out using the segmentation and annotation tool
described in . In case of errors, the shot segmentation results can be
manually edited by splitting and merging shots. The meanings of each shot are
annotated using a set of keywords from a predefined hierarchical lexicon. The
annotations are saved in an MPEG-7 compliant XML file.
2.2. System architecture
Figure 1 presents the proposed client-server system architecture, which is based
on the IST PRIAM project .
Figure 1. Client-server system architecture
We consider two types of client requests: the navigation request and the
The navigation requests (zooming, panning etc) are translated at the client
side into a request for Windows of Interest (WOI) . A WOI specifies a
spatial region, a quality level and a resolution level. At the server side, the
JPEG2000 WOI parser converts this WOI request into the selection of relevant
JPEG2000 packets using the codestream index file. These packets contain
additional data improving the quality of the requested regions once transmitted
and decoded at the user’s side.
The retrieval queries, based on the keywords from a predefined lexicon, are
linked at the server side with shots indices by searching through the MPEG-7
annotation file of the video summary. The selected shots are then associated to
WOI’s. The WOIs’ spatial region is defined by the shot’s keyframe size and
position in the mosaic. The corresponding WOIs specify the highest available
quality and resolution levels so as to highlight keyframe of interest as much as
A video player module is also implemented to allow the user to play a
particular shot/scene of interest, after browsing through the highlighted images.
In this section we discuss the application scenarios of the proposed system and
present the experimental results obtained.
3.1. Typical scenario
A typical application scenario is described as follows. First, the GUI displays an
overview of the JPEG2000 coded mosaic, which is obtained by decoding only a
greyscale version of the lowest resolution and quality levels. The user then
selects the keywords from the annotation dictionary and requests the server to
present certain desired and more detailed contents of the video. The relevant
keyframes, retrieved by the server will then be highlighted by the GUI.
Enhancing the visual quality and resolution of keyframes within the initial low
quality overview clearly shows the temporal semantic links among the contents
of these shots and scenes. The user can also pan and zoom in the video summary
and choose to play the video clip of a particular shot of interest.
3.2. Preliminary trials
To evaluate the functionality of the prototype system, experiments are performed
on two movie excerpts; one from “Notting Hill” and the other from “A Beautiful
Mind”. Table 1 specifies the attributes of the two video sequences including
their respective keyframe-based video summary as well as the compressed
summary size. A snapshot of the system in action is shown in Figure 2.
Table 1. Characteristics of the chosen video excerpts and the associated keyframe-based
No. of keyframes
“A Beautiful Mind”
Summary dimensions 5632×4320 4576×3120
Compressed summary size