JPEG2000 Based Scalable Summary for Remote Video Content Browsing and Efficient Semantic Structure Understanding.
ABSTRACT This paper presents a new method for remote and interactive browsing of long video sequences. The solution is based on interactive navigation in a scalable mega image resulting from a JPEG2000 coded keyframe-based video summary. The presented system is compliant with the new JPEG2000 part 9 "JPIP – JPEG2000 Interactivity, API and Protocol," which lends itself to working under varying channel conditions such as wireless networks. The flexibility offered by JPEG2000 allows the application to highlight interactively keyframes corresponding to the desired content first within a low quality and low-resolution version of the full video summary. It then offers a fine grain scalability for a user to navigate and zoom in to particular scenes or events represented by the keyframes. This possibility to visualise keyframes of interest and playback the corresponding video shots within the context of the whole sequence enables the user to understand the temporal relations between semantically similar events, i.e. a new way to analyse long video sequences.
- SourceAvailable from: Sean Marlow[Show abstract] [Hide abstract]
ABSTRACT: In this paper we present a variety of browsing interfaces for digital video information. The six interfaces are implemented on top of Físchlár, an operational recording, indexing, browsing and playback system for broadcast TV programmes. In developing the six browsing interfaces, we have been informed by the various dimensions which can be used to distinguish one interface from another. For this we include layeredness (the number of “layers” of abstraction which can be used in browsing a programme), the provision or omission of temporal information (varying from full timestamp information to nothing at all on time) and visualisation of spatial vs. temporal aspects of the video. After introducing and defining these dimensions we then locate some common browsing interfaces from the literature in this 3-dimensional “space” and then we locate our own six interfaces in this same space. We then present an outline of the interfaces and include some user feedback.12/1999: pages 206-218;
Conference Paper: Generation of interactive multi-level video summaries.[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we describe how a detail-on-demand representation for interactive video is used in video summarization. Our approach automatically generates a hypervideo composed of multiple video summary levels and navigational links between these summaries and the original video. Viewers may interactively select the amount of detail they see, access more detailed summaries, and navigate to the source video through the summary. We created a representation for interactive video that supports a wide range of interactive video applications and Hyper-Hitchcock, an editor and player for this type of interactive video. Hyper-Hitchcock employs methods to determine (1) the number and length of levels in the hypervideo summary, (2) the video clips for each level in the hypervideo, (3) the grouping of clips into composites, and (4) the links between elements in the summary. These decisions are based on an inferred quality of video segments and temporal relations those segments.Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, November 2-8, 2003; 01/2003
- [Show abstract] [Hide abstract]
ABSTRACT: Recent advances in digital video compression and networks have made video more accessible than ever. However, the existing content-based video retrieval systems still suffer from the following problems. 1) Semantics-sensitive video classification problem because of the semantic gap between low-level visual features and high-level semantic visual concepts; 2) Integrated video access problem because of the lack of efficient video database indexing, automatic video annotation, and concept-oriented summary organization techniques. In this paper, we have proposed a novel framework, called ClassView, to make some advances toward more efficient video database indexing and access. 1) A hierarchical semantics-sensitive video classifier is proposed to shorten the semantic gap. The hierarchical tree structure of the semantics-sensitive video classifier is derived from the domain-dependent concept hierarchy of video contents in a database. Relevance analysis is used for selecting the discriminating visual features with suitable importances. The Expectation-Maximization (EM) algorithm is also used to determine the classification rule for each visual concept node in the classifier. 2) A hierarchical video database indexing and summary presentation technique is proposed to support more effective video access over a large-scale database. The hierarchical tree structure of our video database indexing scheme is determined by the domain-dependent concept hierarchy which is also used for video classification. The presentation of visual summary is also integrated with the inherent hierarchical video database indexing tree structure. Integrating video access with efficient database indexing tree structure has provided great opportunity for supporting more powerful video search engines.IEEE Transactions on Multimedia 03/2004; · 1.78 Impact Factor
JPEG2000 BASED SCALABLE SUMMARY FOR REMOTE VIDEO
CONTENT BROWSING AND EFFICIENT SEMANTIC STRUCTURE
7000 Mons, Belgium
E-mail: Jerome.meessen@ multitel.be
Broadband Applications Research Centre
BT Research & Venturing
Adastral Park, Ipswich IP5 3RE, UK
Communication and Remote Sensing Lab.
This paper presents a new method for remote and interactive browsing of long
video sequences. The solution is based on interactive navigation in a scalable
mega image resulting from a JPEG2000 coded keyframe-based video summary.
The presented system is compliant with the new JPEG2000 part 9 “JPIP –
JPEG2000 Interactivity, API and Protocol,” which lends itself to working
under varying channel conditions such as wireless networks. The flexibility
offered by JPEG2000 allows the application to highlight interactively
keyframes corresponding to the desired content first within a low quality and
low-resolution version of the full video summary. It then offers a fine grain
scalability for a user to navigate and zoom in to particular scenes or events
represented by the keyframes. This possibility to visualise keyframes of interest
and playback the corresponding video shots within the context of the whole
sequence enables the user to understand the temporal relations between
semantically similar events, i.e. a new way to analyse long video sequences.
c This work was partially supported by the EU FP5 project SCHEMA, IST-2001-32795.
With the advent of digital revolution, the increasing network connectivity and
bandwidth, as well as the mass production and distribution of rich audio-visual
media, the demands from end users are becoming ever more urgent for a fast and
easy access to video program summaries in order to browse and visualize
desirable contents . A video content summary often takes the form of a 2-D
presentation on a visualisation interface, which is made up of selected frames, or
keyframes, representing semantically related data chunks, i.e. shots or events.
Depending on the media genre and applications, many different layouts for
keyframes presentation are possible, as discussed in .
However, summarizing the content of a long video sequence this way for
entertainment genres such as a feature movie or drama, the number of selected
keyframes is still way too many. Two problems are ensued. First, the semantic
story structures will be largely buried in the numerous images displayed. And,
secondly, in the case of a user accessing the summary stored on a remote server,
the transmission over the network of a large number of keyframes is a major
problem. Today, there exist different approaches to addressing these issues.
Building condensed and semantically relevant video summaries has been seen in
the work by Yeung and Yeo  and by Chiu et al. . However, though these
summaries provide a good overview of a sequence, they tend to present to the
user one pre-defined semantic level illustration of the sequence only.
Hierarchical shots clustering and presentation is another common approach to
browsing either one video sequence  or a video sequences database .
In particular, the shot clustering and browsing methods described  and 
are particularly interesting since they are evaluated regarding the amount of
transmitted data at each user retrieval request. . These are efficient solutions to
getting a quick overview of a video sequence and to finding a particular scene of
interest. However, they do not have provisions for contextual visualization of the
links between semantically similar scenes, i.e., to answer a user’s queries like
“What happened before and after that particular event?” “Are there any other
similar events taking place in the story, and if so, what are their temporal
In this paper, we focus on helping the user to understand the underlying
semantic structure of a video sequence, i.e. to establish relations between
semantically similar scenes and highlight them within the context of the whole
sequence. Rather than to propose a complex shot clustering strategy or
storyboard layout, we exploit the user’s intelligence by providing him/her with
interactive tools for intuitive navigation in a remotely stored scalable summary.
The idea is to exploit the powerful features of compression and scalable
representation in JPEG2000, the new standard for still image compression ,
and produce scalable keyframe-based summaries of a video sequence, while
allowing for at the same time semantics-based queries. JPEG2000 allows a
flexible access to each spatial region of the compressed image, at a different
resolution and PSNR quality level . This is particularly suited to browsing
very large images as discussed in . We present here a layered platform
compliant with the forthcoming JPEG2000 Part 9 “JPIP – JPEG2000 interactive
protocol” . While storing only one detailed keyframe based video
summary, or storyboard, this standard communication between server and client
provides the means to transmit efficiently many different versions of the
storyboard over a network. Moreover, JPIP offers means to adapt the
transmission to changing channel conditions allowing an efficient transmission
of the summary data with any type of channel conditions and user processing
resources. This particularly suits video browsing using mobile devices.
The annotation of shots and scenes of the video summary is based on
MPEG-7 visual content description schemas . After the temporal
decomposition of a video sequence into segments, i.e. shots or scenes, and
necessary manual annotation of their contents, an MPEG-7 compliant XML
description file specifies, for each of these segments, a number of attributes,
including the text annotations (scene, object, action) and time information – the
start and duration of the segment, and the position of the keyframe selected for
the shot . The annotations of shots allow translating content-based queries
into image-oriented requests.
2. System Framework
This section discusses the two core components underlying the proposed video
content browsing and retrieval system.
2.1. Scalable keyframe-based video summary with annotation
The work flow used to create the coded keyframe-based video summary is as
follows. The original video sequence is segmented into shots, and one keyframe
is selected for each shot to represent its visual content. The representation
scheme can be extended to a group of shots, or sub-scene or scene, to avoid the
redundancy in displayed visual content, as in the case of “A Beautiful Mind”
video discussed in the experimental section. The keyframe images are then
arranged in raster scanning order to compose a large mosaic image, which is then
JPEG2000 compressed to output two files: the JPEG2000 codestream and its
associated index file as defined in JPEG2000 Part 9 ‘JPIP’. The compression is
done with at least two quality levels, to be able to highlight keyframes of interest,
and with different resolution levels. Moreover, the JPEG2000 coding parameters
are chosen such that each keyframe can be accessed separately.
Automatic detection of shot boundaries and the selection of keyframe(s) for
each shot / scene are carried out using the segmentation and annotation tool
described in . In case of errors, the shot segmentation results can be
manually edited by splitting and merging shots. The meanings of each shot are
annotated using a set of keywords from a predefined hierarchical lexicon. The
annotations are saved in an MPEG-7 compliant XML file.
2.2. System architecture
Figure 1 presents the proposed client-server system architecture, which is based
on the IST PRIAM project .
Figure 1. Client-server system architecture
We consider two types of client requests: the navigation request and the
The navigation requests (zooming, panning etc) are translated at the client
side into a request for Windows of Interest (WOI) . A WOI specifies a
spatial region, a quality level and a resolution level. At the server side, the
JPEG2000 WOI parser converts this WOI request into the selection of relevant
JPEG2000 packets using the codestream index file. These packets contain
additional data improving the quality of the requested regions once transmitted
and decoded at the user’s side.
The retrieval queries, based on the keywords from a predefined lexicon, are
linked at the server side with shots indices by searching through the MPEG-7
annotation file of the video summary. The selected shots are then associated to
WOI’s. The WOIs’ spatial region is defined by the shot’s keyframe size and
position in the mosaic. The corresponding WOIs specify the highest available
quality and resolution levels so as to highlight keyframe of interest as much as
A video player module is also implemented to allow the user to play a
particular shot/scene of interest, after browsing through the highlighted images.
In this section we discuss the application scenarios of the proposed system and
present the experimental results obtained.
3.1. Typical scenario
A typical application scenario is described as follows. First, the GUI displays an
overview of the JPEG2000 coded mosaic, which is obtained by decoding only a
greyscale version of the lowest resolution and quality levels. The user then
selects the keywords from the annotation dictionary and requests the server to
present certain desired and more detailed contents of the video. The relevant
keyframes, retrieved by the server will then be highlighted by the GUI.
Enhancing the visual quality and resolution of keyframes within the initial low
quality overview clearly shows the temporal semantic links among the contents
of these shots and scenes. The user can also pan and zoom in the video summary
and choose to play the video clip of a particular shot of interest.
3.2. Preliminary trials
To evaluate the functionality of the prototype system, experiments are performed
on two movie excerpts; one from “Notting Hill” and the other from “A Beautiful
Mind”. Table 1 specifies the attributes of the two video sequences including
their respective keyframe-based video summary as well as the compressed
summary size. A snapshot of the system in action is shown in Figure 2.
Table 1. Characteristics of the chosen video excerpts and the associated keyframe-based
No. of keyframes
“A Beautiful Mind”
Summary dimensions 5632×4320 4576×3120
Compressed summary size