ArticlePDF Available

Continuous Media Web: Hyperlinking, Search and Retrieval of Time-Continuous Data on the Web

Authors:

Abstract and Figures

The Continuous Media Web project has developed a technology to extend the Web to time-continuously sampled data enabling seamless searching and surfing with existing Web tools. This chapter discusses requirements for such an extension of the Web, contrasts existing technologies and presents the Annodex technology, which enables the creation of Webs of audio and video documents. To encourage uptake, the specifications of the Annodex technology have been submitted to the IETF for standardisation and open source software is made available freely. The Annodex technology permits an integrated means of searching, surfing, and managing a World Wide Web of textual and media resources.
Content may be subject to copyright.
Continuous Media Web: Hyperlinking, Search & Retrieval of
Time-Continuous Data on the Web
Silvia Pfeiffer
Conrad Parker
André Pang
CSIRO ICT Centre
Locked Bag 17
North Ryde NSW 1670
Australia
CSIRO ICT Centre
Locked Bag 17
North Ryde NSW 1670
Australia
CSIRO ICT Centre
Locked Bag 17
North Ryde NSW 1670
Australia
Ph: +61 2 93253141
Fax: +61 2 9325 3200
Silvia.Pfeiffer@csiro.au
Ph: +61 2 9325 3133
Fax: +61 2 9325 3200
Conrad.Parker@csiro.au
Ph: +61 2 9325 3100
Fax: +61 2 9325 3200
Andre.Pang@csiro.au
Continuous Media Web: Hyperlinking, Search & Retrieval of
Time-Continuous Data on the Web
ABSTRACT
The Continuous Media Web project has developed a technology to extend the Web to time-
continuously sampled data enabling seamless searching and surfing with existing Web tools.
This chapter discusses requirements for such an extension of the Web, contrasts existing
technologies and presents the Annodex technology, which enables the creation of Webs of audio
and video documents. To encourage uptake, the specifications of the Annodex technology have
been submitted to the IETF for standardisation and open source software is made available
freely. The Annodex technology permits an integrated means of searching, surfing, and
managing a World Wide Web of textual and media resources.
KEYWORDS
Information and Communication Technologies, Internet Technologies, Multimedia, Server
Technology, Technology Infrastructure, Telecommunications Technology, Web Technology,
Data Retrieval, Document Navigation System, Hypermedia, Indexing, Metadata, Web
Architecture, Data Organization, XML.
INTRODUCTION
Nowadays, the main source of information is the World Wide Web. Its HTTP (Fielding et al.,
1999), HTML (World Wide Web Consortium, 1999B), and URI (Berners-Lee et al., 1998)
standards have enabled a scalable, networked repository of any sort of information that people
care to publish in textual form. Web search engines have enabled humanity to search for any
information on any public Web server around the world. URI hyperlinks in HTML documents
have enabled surfing to related information, giving the Web its full power. Repositories of
information within organisations are also building on these standards for much of their internal
and external information dissemination.
While Web searching and surfing has become a natural way of interacting with textual
information to access their semantic content, no such thing is possible with media. Media on the
Web is cumbersome to use: it is handled as dark matter that cannot be searched through Web
search engines, and once a media document is accessed, only linear viewing is possible - no
browsing or surfing to other semantically related documents.
Multimedia research of the recent years has realised this issue. One means to enable search on
media documents is to automate the extraction of content, store the content as index
information, and provide search facilities through that index information. This has led to
extensive research on the automated extraction of metadata from binary media data, aiming at
bridging the semantic gap between automatically extracted low level image, video, and audio
features, and the high level of semantics that humans perceive when viewing such material (see
e.g. Dimitrova et al, 2002).
It is now possible to create and store a large amount of metadata and semantic content from
media documents – be that automatically or manually. But how do we exploit such a massive
amount of information in a standard way? What framework can we build to satisfy the human
need to search for content in media, to quickly find and access it for reviewing, and to manage
and reuse it in an efficient way?
As the Web is the most commonly used means for information access, we decided to develop a
technology for time-continuous documents that enables their seamless integration into the Web's
searching and surfing. Our research is thus extending the World Wide Web with its familiar
information access infrastructure to time-continuous media such as audio and video, creating a
"Continuous Media Web".
Particular aims of our research are:
to enable the retrieval of relevant clips of time-continuous documents through familiar
textual queries in Web search engines,
to enable the direct addressing of relevant clips of time-continuous documents through
familiar URI hyperlinks,
to enable hyperlinking to other relevant and related Web resources while reviewing a time-
continuous document, and
to enable automated reuse of clips of time-continuous documents.
This chapter presents our developed Annodex (annotation and indexing) technology, the
specifications of which have been published at the IETF (Internet Engineering Task Force) as
Internet-Drafts for the purposes of international standardisation. Implementations of the
technology are available at http://www.annodex.net/. In the next section we present related
works and their shortcomings with respect to our aims. We then explain the main principles that
our research and development work adheres. The subsequent section provides a technical
description of the Continuous Media Web (CMWeb) project and thus forms the heart of this
book chapter. We round it off with a view on research opportunities created by the CMWeb, and
conclude the paper with a summary.
BACKGROUND
The World Wide Web was created by three core technologies (Berners-Lee et al., 1999):
HTML, HTTP, and URIs. They respectively enable:
the markup of textual data integrated with the data itself giving it a structure, metadata, and
outgoing hyperlinks,
the distribution of Web documents over the Internet, and
the hyperlinking to and into Web documents.
In an analogous way, what is required to create a Web of time-continuous documents is:
a markup language to create addressable structure, searchable metadata, and outgoing
hyperlinks for a continuous media document,
an integrated document format that can be distributed via HTTP making use of existing
caching HTTP proxy infrastructure,
and a means to hyperlink into a continuous media document.
One expects that the many existing standardisation efforts in multimedia would cover these
requirements. However, while the required pieces may exist, they are not packaged and
optimised for addressing the issues and for solving them in such a way as to make use of the
existing Web infrastructure with the least necessary adaptation efforts. Here we look at the three
most promising standards: SMIL, MPEG-7, and MPEG-21.
SMIL
The W3C's SMIL (World Wide Web Consortium, 2001), short for "Synchronized Multimedia
Interaction Language", is an XML markup language used for authoring interactive multimedia
presentations. A SMIL document describes the sequence of media documents to play back,
including conditional playback, loops, and automatically activated hyperlinks. SMIL has
outgoing hyperlinks and elements that can be addressed inside it using XPath (World Wide Web
Consortium, 1999A) and XPointer (World Wide Web Consortium, 2002).
Features of SMIL cover the following modules:
1. Animation: provides for incorporating animations onto a time line.
2. Content Control: provides for runtime content choices and prefetch delivery.
3. Layout: allows positioning of media elements on the visual rendering surface and control of
audio volume.
4. Linking: allows navigations through the SMIL presentation that can be triggered by user
interaction or other triggering events. SMIL 2.0 provides only for in-line link elements.
5. Media Objects: describes media objects that come in the form of hyperlinks to animations,
audio, video, images, streaming text, or text. Restrictions of continuous media objects to
temporal subparts (clippings) are possible, and short and long descriptions may be attached
to a media object.
6. Metainformation: allows description of SMIL documents and attachment of RDF metadata
to any part of the SMIL document.
7. Structure: structures a SMIL document into a head and a body part, where the head part
contains information that is not related to the temporal behaviour of the presentation and the
body tag acts as a root for the timing tree.
8. Timing and Synchronization: provides for different choreographing of multimedia content
through timing and synchronization commands.
9. Time Manipulation: allows manipulation of the time behaviour of a presentation, such as
control of the speed or rate of time for an element.
10. Transitions: provides for transitions such as fades and wipes.
11. Scalability: provides for the definition of profiles of SMIL modules (1-10) that meet the
needs for a specific class of client devices.
SMIL is designed for creating interactive multimedia presentations, not for setting up Webs of
media documents. A SMIL document may result in a different experience for every user and
therefore is not a single, temporally addressable time-continuous document. Thus, addressing
temporal offsets does not generally make sense on a SMIL document.
SMIL documents cannot generally be searched for clips of interest as they don’t typically
contain the information required by a Web search engine: SMIL does not focus on including
metadata, annotations and hyperlinks, thus it does not provide for the information necessary to
be crawled and indexed by a search engine.
In addition, SMIL does not integrate the media documents required for its presentation in one
single file, but instead references them from within the XML file. All media data is only
referenced and there is no transport format for a presentation that includes all the relevant
metadata, annotations, and hyperlinks interleaved with the media data to provide a streamable
format. This would not make sense anyway as some media data that is referenced in a SMIL file
may never be viewed by users as they may never activate the appropriate action. SMIL
interaction's media streams will be transported on separate connections to the initial SMIL file,
requiring the client to perform all the media synchronization tasks, and proxy caching can
happen only on each file separately, not on the complete interaction.
Note, however, that a single SMIL interaction, if recorded during playback, can become a single
time-continuous media document, which can be treated with our Annodex technology to enable
it to be searched and surfed. This may be interesting for archiving and digital record-keeping.
MPEG-21
The ISO/MPEG's MPEG-21 (Burnett et al., 2003) standard is building an open framework for
multimedia delivery and consumption. It thus focuses on addressing how to generically describe
a set of content documents that belong together from a semantic point of view, including all the
information necessary to provide services on these digital items. This set of documents is called
a Digital Item, which is a structured representation in XML of a work including identification,
and metadata information.
The representation of a Digital Item may be composed of the following descriptors:
1. Container: is a structure that groups items and/or containers.
2. Item: a group of subitems and/or components bound to relevant descriptors.
3. Component: binds a resource to a set of descriptors including control or structural
information of the resource.
4. Anchor: binds a descriptor to a fragment of a resource.
5. Descriptor: associates information (i.e. text or a component) with the enclosing element.
6. Condition: makes the enclosing element optional and links it to the selection(s) that affect its
inclusion.
7. Choice: is a set of related sections that can affect an item’s configuration.
8. Selection: is a specific decision that affects one or more conditions somewhere within an
item.
9. Annotation: is a set of information about an identified element of the model.
10. Assertion: is a fully or partially configured state of choice by asserting true/false/undecided
for predicates associated with the selections for that choice.
11. Resource: is an individually identifiable asset such as a video clip, audio clip, image or
textual asset, or even a physical object, locatable via an address.
12. Fragment: identifies a specific point or range within a resource.
13. Statement: is a literal text item that contains information, but is not an asset.
14. Predicate: is an identifiable declaration that can be true/false/undecided.
MPEG-21 further provides for the handling of rights associated with Digital Items, and for the
adaptation of Digital Items to usage environments.
As an example for a Digital Item, consider a music CD album. When it is turned into a digital
item, the album is described in an XML document that contains references to the cover image,
the text on the CD cover, the text on an accompanying brochure, references to a set of audio
files that contain the songs on the CD, ratings of the album, rights associated with the album,
information on the different encoding formats in which the music can be retrieved, different
bitrates that can be supported when downloading etc. This description supports the handling of a
digital CD album as an object: it allows you to manage it as an entity, describe it with metadata,
exchange it with others, and collect it as an entity.
An MPEG-21 document does not typically describe just one time-continuous document, but
rather several. These descriptions are temporally addressable and hyperlinks can go into and out
of them. Metadata can be attached to the descriptions of the documents making them searchable
and indexable for search engines.
As can be seen, MPEG-21 addresses the problem of how to handle groups of files rather than
focusing on the markup of a single media file, and therefore does not address how to directly
link into time-continuous Web resources themselves. There is a important difference between
linking into and out of descriptions of a time-continuous document and linking into and out of a
time-continuous document itself – integrated handling provides for cacheablility and for direct
URI access.
The aims of MPEG-21 are orthogonal to the aims that we pursue. While MPEG-21 enables a
better handling of collections of Web resources that belong together in a semantic way,
Annodex enables a more detailed handling of time-continuous Web resources only. Annodex
provides a granularity of access into time-continuous resources that an MPEG-21 Digital Item
can exploit in its descriptions of collections of Annodex and other resources.
MPEG-7
ISO/MPEG's MPEG-7 (Martinez et al., 20020) standard is an open framework for describing
multimedia entities, such as image, video, audio, audio–visual, and multimedia content. It
provides a large set of description schemes to create markup in XML format.
MPEG-7 description schemes can provide the following features:
1. Specification of links and locators (such as time, media locators, and referencing description
tools).
2. Specification of basic information such as people, places, textual annotations, controlled
vocabularies, etc.
3. Specification of the spatio-temporal structure of multimedia content.
4. Specification of audio and visual features of multimedia content.
5. Specification of the semantic structure of multimedia content.
6. Specification of the multimedia content type and format for management.
7. Specification of media production information.
8. Specification of media usage (rights, audience, financial) information.
9. Specification of classifications for multimedia content.
10. Specification of user information (user description, user preferences, usage history).
11. Specification of content entities (still region, video/audio/audio-visual segments, multimedia
segment, ink content, structured collections).
12. Specification of content abstractions (semantic descriptions, media models, media
summaries, media views, media variations).
The main intended use of MPEG-7 is for describing multimedia assets such that they can be
queried or filtered. Just like SMIL and MPEG-21, the MPEG-7 descriptions are regarded
completely independent of the content itself.
An MPEG-7 document is an XML file that contains any sort of meta information related to a
media document. While the temporal structure of a media document can be represented, this is
not the main aim of MPEG-7 and not typically the basis for attaching annotations and
hyperlinks. This is the exact opposite approach to ours where the basis is the media document
and its temporal structure. Much MPEG-7 markup is in fact not time-related and thus does not
describe media content at the granularity we focus on. Also, MPEG-7 does not attempt to create
a temporally interleaved document format that integrates the markup with the media data.
Again, the aims of MPEG-7 and Annodex are orthogonal. As MPEG-7 is a format that focuses
on describing collections of media assets, it is a primarily database-driven approach towards the
handling of information, while Annodex comes from a background of Web-based, and therefore
network-based, handling of media streams. A specialisation (or in MPEG-7 terms profile) of
MPEG-7 descriptions schemes may allow the creation of annotations similar to the ones
developed by us, but the transport-based interleaved document format that integrates the markup
with the media data in a streamable fashion is not generally possible with MPEG-7 annotations.
Annotations created in MPEG-7 may however be referenced from inside an Annodex format
bitstream, and some may even be included directly into the markup of an Annodex format
bitstream through the "meta" and "desc" tags.
3. THE CHALLENGE
The technical challenge for the development of Annodex (Annodex.net, 2004) was the creation
of a solution to the three issues presented in Section 2:
Create a HTML-like markup language for time-continuous data,
that can be interleaved with the media stream to create a searchable media document, and
create a means to hyperlink by temporal offset into the time-continuous document.
We have developed three specifications:
1. CMML (Pfeiffer, Parker, and Pang, 2003A), the "Continuous Media Markup Language"
which is based on XML and provides tags to mark up time-continuous data into sets of
annotated temporal clips. CMML draws upon many features of HTML.
2. Annodex (Pfeiffer, Parker, and Pang, 2003B), the binary stream format to store and transmit
interleaved CMML and media data.
3. temporal URIs (Pfeiffer, Parker, and Pang, 2003C), which enables hyperlinking to
temporally specified sections of an Annodex resource.
Aside from the above technical requirements, the development of these technologies has been
led by several principles and non-technical requirements. It is important to understand these
constraints as they have strongly influenced the final format of the solution.
Hook into existing Web Infrastructure:
The Annodex technologies have been designed to hook straight into the existing Web
infrastructure with as few adaptations as necessary. Also, the scalability property of the Web
must not be compromised by the solution. Thus, CMML is very similar to HTML, temporal
URI queries are CGI (Common Gateway Interface) style parameters (NCSA HTTPd
Development Team, 1995), temporal URI fragments are like HTML fragments, and Annodex
streams are designed to be cacheable by Web proxies.
Open Standards:
The aim of the CMWeb project is to extend the existing World Wide Web to time-continuous
data such as audio and video and create a more powerful networked world-wide infrastructure.
Such a goal can only be achieved if the different components that make up the infrastructure
interoperate even when created by different providers. Therefore, all core specifications are
being published as open international standards with no restraining patent issues.
The Annodex Trademark:
For an open standard, interoperability of different implementations is crucial to its success. Any
implementation that claims to implement the specification but is not conformant and thus not
interoperable will be counterproductive to the creation of a common infrastructure. Therefore,
registering a Trademark on the word "Annodex" enables us to stop non-conformant
implementations from claiming to be conformant by using the same name.
Free media codecs:
For the purpose of standardisation it is important that internet-wide usage is encouraged to use
media codecs for which no usage restrictions exist. The codecs must be legal to use in all
Internet connected devices and compatible with existing Web infrastructure. This however does
not mean that the technology is restricted to specific codecs – on the contrary: Annodex works
for any time-continuously sampled digital data file. Please also note that we do not develop
codecs ourselves, but rather provide recommendations for which codecs to support.
Open Source:
Open standards require reference implementations that people can learn from and make use of
for building up the infrastructure. Therefore, the reference software should be published as open
source software. According to Tim Berners-Lee this was essential to the development and
uptake of the Web (Berners-Lee et al., 1999).
Device Independence:
As convergence of platforms continues, it is important to design new formats such that they can
easily be displayed and interacted with on any networked device, be that on a huge screen, or on
a small handheld device screen. Therefore, Annodex is being designed to work independently of
any specific features an end device may have.
Generic Metadata:
Metadata for time-continuous data can come in may different structured or unstructured
schemes. It can be automatically extracted or manually extracted, follow the standard Dublin
Core metadata scheme (Dublin Core Metadata Initiative, 2003), or a company specific metadata
scheme. Therefore, it is important to specify the metadata types in a generic manner to allow
free text and any set of name-value pairs as metadata. It must be possible to develop more
industry-specific sets of metadata schemes later and make full use of them in Annodex.
Simplicity:
Above all, the goal of Annodex is to create a very simple set of tools and formats for enabling
time-continuous Web resources with the same powerful means of exploration as text content on
the Web.
These principles stem from a desire to make simple standards that can be picked up and
integrated quickly into existing Web infrastructure.
THE SOLUTION
Surfing and Searching
The double aim of Annodex is to enable Web users to:
view, access and hyperlink between clips of time-continuous documents in the same simple,
but powerful way as HTML pages, and
search for clips of time-continuous documents through the common Web search engines,
and to retrieve clips relevant to their query.
Figures 1 and 2 show screen shots of an Annodex Web browser and an Annodex search engine.
The Annodex browser's main window displays the media data, typical Web browser buttons and
fields (at the top), typical media transport buttons (at the bottom), and a representative image
(also called a keyframe) and hyperlink for the currently displayed media clip. The story board
next to the main window displays the list of clips that the current resource consists of, enabling
direct access to any clip in this table of contents. The separate window on the top right displays
the free-text annotation stored in the description for the current clip, while the one on the lower
right displays the structured metadata stored for the resource or for the current clip. When
crawling this particular resource, a Web search engine can index all this textual information.
Figure 1: Browsing a video about CSIRO astronomy research
The Annodex search engine displayed in Figure 2 is a standard Web search engine extended
with the ability to crawl and index the markup of Annodex resources. It retrieves clips that are
relevant to the user’s query and presents ranked search results based on the relevance of the
markup of the clips. The keyframe of the clip and its description are displayed.
Figure 2: Searching for "radio galaxies" in CSIRO's science CMWeb
Architecture Overview
As the Continuous Media Web technology has to be interoperable with existing Web
technology, its architecture must be the same as the World Wide Web (see Figure 3): there is a
Web client, that issues a URI request over HTTP to a Web server, who resolves it and serves out
the requested resource back to the client. In the case where the client is a Continuous Media
Web browser, the request will be for an Annodex file, which contains all the relevant markup
and media data to display the content to the user. In the case where the client is a Web crawler
(e.g. part of a Web search engine), the client may add a HTTP “Content-type” request header
with a preference for receiving only the CMML markup and not the media data. This is possible
because the CMML markup represents all the textual content of an Annodex file and is thus a
thin representation of the full media data. In addition it is a bandwidth-friendly means of
crawling and indexing media content, which is very important for scalability of the solution.
Figure 3: The Continuous Media Web Architecture
The Annodex file format
Annodex is the format in which media with interspersed CMML markup is transferred over the
wire. Analogous to a normal Web server offering a collection of HTML pages to clients, an
Annodex server offers a collection of Annodex files. After a Web client has issued a URI
request for an Annodex resource, the Web server delivers the Annodex resource, or an
appropriate subpart of it according to the URI query parameters.
Annodex files conceptually consist of multiple media streams and one CMML annotation
stream, interleaved in a temporally synchronised way. The annotation stream may contain
several sets of clips that provide alternative markup tracks for the Annodex file. The media
streams may be complementary, such as an audio track with a video track, or alternative, such as
two speech tracks in different languages. Figure 4 shows an example Annodex file with three
media tracks (light coloured bars) and an annotation track with a header describing the complete
file (dark bar at the start) and several interspersed clips.
Figure 4: An example Annodex file, time increasing from left to right.
One way to author Annodex files is by creating a CMML markup file and encoding the media
data together with the markup based on the authoring instructions found in the CMML file.
Figure 5 displays the principle of the Annodex file creation process: the header information of
the CMML file and the media streams are encoded at the start of the Annodex file, while the
clips and the actual encoded media data are appended thereafter in a temporally interleaved
fashion.
Figure 5: The Annodex file creation process
The choice of a binary encapsulation format for Annodex files was one of the challenges of the
CMWeb project. We examined several different encapsulation formats and came up with a list
of requirements:
1. the format had to provide framing for binary media data and XML markup,
2. temporal synchronisation between media data and XML markup was necessary,
3. the format had to provide a temporal track paradigm for interleaving,
4. the format had to have streaming capabilities,
5. for fault tolerance, resynchronisation after a parsing error should be simple,
6. seeking landmarks were necessary to allow random access,
7. the framing information should only yield a small overhead, and
8. the format needed to be simple to allow handling on devices with limited capabilities.
Hierarchical formats like MPEG-4 and QuickTime did not qualify due to requirement 2, making
it hard to also provide for requirements 3 and 4. An XML based format also did not qualify
because binary data cannot be included in XML tags without having to encode it in base64,
which is inflating the data size by about 30% and creating unnecessary additional encoding and
decoding steps, and thus violating requirements 1 and 7. After some discussion, we adopted the
Ogg encapsulation format (Pfeiffer, 2003) developed by Xiphophorous (Xiphophorus, 2004).
That gave us the additional advantage of having Open Source libraries available on all major
platforms, much simplifying the task of rolling out format support.
The Continuous Media Markup Language CMML
CMML is simple to understand, as it is HTML-like, though oriented towards a segmentation of
continuous data along its time axis into clips. A sample CMML file is given below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE cmml SYSTEM "cmml.dtd">
<cmml>
<stream timebase=”0” utc=”20040114T153500.00Z”>
<import src=”galaxies.mpg” contenttype=“video/mpeg" start=“npt:0"/>
</stream>
<head>
<title>Hidden Galaxies</title>
<meta name="author" content="CSIRO"/>
</head>
<clip id=”findingGalaxies” start=”15”>
<a href=“http://www.aao.gov.au/galaxies.anx#radio”>
Related video on Detection of Galaxies</a>
<img src=”galaxy.jpg”/>
<desc>What’s out there? ...</desc>
<meta name=“KEYWORDS" content=“Radio Telescope, Galaxies"/>
</clip>
</cmml>
As the sample file shows, CMML has XML syntax, consisting of three main types of tags: At
most one stream tag, exactly one head tag, and an arbitrary number of clip tags.
The stream tag is optional. It describes the input bitstreams necessary for the creation of an
Annodex file in the import tags, and gives some timing information necessary for the output
Annodex file. The import bitstream will be interleaved into multiple tracks of media, even if
they start at different time offsets and need to be temporally realigned through the start
attribute.
The markup of a head tag in the CMML document contains information about the complete
media document. Its essential information comprises of
structured textual annotations in meta tags, and
unstructured textual annotations in the title tag.
Structured annotations are name-value pairs which can follow a new or existing metadata
annotation scheme such as the Dublin Core (Dublin Core Metadata Initiative, 2003).
The markup of a clip tag contains information on the various clips or fragments of the media:
Anchor points provide entry points into the media document that a URI can refer to. Anchor
points identify the start time and the name (id) of a clip. This enables URIs to refer to
Annodex clips by name.
URI hyperlinks can be attached to a clip, linking out to any other place a URI can point to,
such as clips in other annodexed media or HTML pages. These are given by the a (anchor)
tag with its href attribute. Furthermore, the a tag contains a textual annotation of the link,
the so-called anchor text (in the example above: "Related video on Detection of
Galaxies") specifying why the clip is linked to a given URI. Note that this is similar to the
a tag in HTML.
An optional keyframe in the img tag provides a representative image for the clip and
enables display of a story board for Annodex files.
Unstructured textual annotations in the desc tags provide for searchability of Annodex files.
Unstructured annotation is free text that describes the clip itself.
Each clip belongs to a specific set of temporally non-overlapping clips that make up one track
of annotations for a time-continuous data file. The track attribute of a clip provides this
attribution – if it is not specified, the clip belongs to the default track.
Using the above sample CMML file for authoring Annodex, the result will be a "galaxies.anx"
file of the form given in Figure 6 below.
Figure 6: The Annodex file created from the sample CMML file
Specifying time segments and clips in URIs
Linking to time segments and clips in URIs
A URI points to a Web resource, and is the primary mechanism on the Web to reach
information. Time-continuous Web resources are typically large data files. Thus, when a Web
user wants to link to the exact segment of interest within the time-continuous resource, it is
desirable that only that segment is transferred. This reduces network load and user waiting time.
No standardised scheme is currently available to directly link to segments of interest in a time-
continuous Web resource. However, addressing of subparts of Web resources is generally
achieved through URI query specifications. Therefore, we defined a query scheme to allow
direct addressing of segments of interest in Annodex files.
Two fundamentally different ways of addressing information in a Annodex resources are
necessary: addressing of clips and addressing of time offsets or time segments.
Linking to clips
Clips in Annodex files are identified by their id attribute. Thus, accessing a named clip in an
Annodex (and, for that matter, a CMML) file is achieved with the following CGI conformant
query parameter specification:
id="clip_id"
Examples for accessing a clip in the above given sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml?id="findingGalaxies"
http://www.annodex.net/galaxies.anx?id="findingGalaxies"
On the Annodex server, the CMML and Annodex resources will be pre-processed as a result of
this query before being served out: the file header parts will be retained, the time basis will be
adjusted and the queried clip data will be concatenated at the end to regain conformant file
formats.
Linking to time segments
It is also desirable to be able to address any arbitrary time segment of an Annodex or CMML
file. This is again achieved with a CGI conformant query parameter specification:
t=[time-scheme:]time_interval
Available time schemes are npt for normal play time, different smpte specifications of the
Society of Motion Pictures and Television Engineers (SMPTE), and clock for a Universal Time
Code (UTC) time specification. For more details see the specification document (Pfeiffer,
Parker, and Pang, 2003C).
Examples for requesting one or several time intervals from the above given sample CMML and
Annodex files are:
http://www.annodex.net/galaxies.cmml?t=85.28
http://www.annodex.net/galaxies.anx?t=npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx?t=smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx?t=clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time interval covered
from that time point onwards until the end of the stream.
The same pre-processing as described above will be necessary on the Annodex server.
Restricting views to time segments and clips in URIs
Aside from the query mechanism, URIs also provide a mechanism to address subparts of Web
resources locally on a Web client: URI fragment specifications. We have found that fragments
are a great mechanism to restrict views on Annodex files to a specific subpart of the resource,
e.g. when viewing or editing a temporal subpart of an Annodex document. Again, two
fundamentally different ways of restricting a time-continuous resource are required: views on a
clip and views on time segments.
Views on clips
Restricting the view on an Annodex (or CMML) file to a named clip makes use of the value of
the id tag of the clip in a fragment specification:
#clip_id
Examples for local clip views for the above given sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml#findingGalaxies
http://www.annodex.net/galaxies.anx#findingGalaxies
The Web client that is asked for such a resource will ask the Web server for the complete
resource and perform its application-specific operation on the clip only. This may for example
result in a sound editor downloading a complete sound file, then selecting the named clip for
further editing. An Annodex browser would naturally behave analogously to an existing Web
browser that receives a html page with a fragment offset: it will fast forward to the named clip
as soon as that clip has been received.
Views on time segments
Analogously to clip views, views can be restricted to time intervals with the following
specification:
#[time-scheme:]time_interval
Examples for restrictions to one or several time intervals from the above given sample CMML
and Annodex files are:
http://www.annodex.net/galaxies.cmml#85.28
http://www.annodex.net/galaxies.anx#npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx#smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx#clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time interval covered
from that time point onwards until the end of the stream. The same usage examples as described
above apply in this case, too. Specifying several time segments may make sense only in specific
applications, such as an editor, where an unconnected selection for editing may result.
FEATURES OF ANNODEX
While developing the Annodex technology, we discovered that the Annodex file format
addresses many challenges of media research that were not part of the original goals of its
development but came to it with serendipity. Some of these will be regarded briefly in this
chapter.
Multitrack media file format
The Annodex file format is based on the Xiph.org Ogg file format (Pfeiffer, 2003) which allows
multiple time-continuous data tracks to be encapsulated in one interleaved file format. We have
extended the file format such that it can be parsed and handled without having to decode any of
the data tracks themselves, making Annodex a generic multitrack media file format. To that end
we defined a generic data track header page which includes a “Content-type” field that
identifies the codec in use and provides some general attributes of the track such as its temporal
resolution. The multitrack file format now has three parts:
1. Data track identifying header pages (primary header pages)
2. Codec header pages (secondary header pages)
3. Data pages
For more details refer to the Annodex format specification document (Pfeiffer, Parker, and
Pang, 2003B).
A standardised multitrack media format is currently non-existent – many applications, amongst
them multitrack audio editors, will be able to take advantage of it, especially since the Annodex
format also allows inclusion of arbitrary meta information.
Multitrack annotations
CMML and Annodex have been designed to provide a means of annotating and indexing time-
continuous data files by structuring their time-line into regions of interest called clips. Each clip
may have structured and unstructured annotations, a hyperlink and a keyframe. A simple
partitioning however does not allow for several different, potentially overlapping subdivisions
of the time-line into clips. After considering several different solutions for such different
subdivisions, we decided to adapt a multitrack paradigm for annotations as well:
every clip of an Annodex or CMML file belongs to one specific annotation track,
clips within one annotation track cannot overlap temporally,
clips on different tracks can overlap temporally as needed,
the attribution of a clip to a track is specified through its track attribute – if it's not given, it's
attributed to a default track.
This is a powerful concept and can easily be represented in browsers by providing a choice of
the track that's visible.
Internationalisation support
CMML and Annodex have also been designed to be language-independent and provide full
internationalisation support. There are two issues to consider for text in CMML elements:
different character sets and different languages.
As CMML is an XML markup language, different character sets are supported through the xml
processing instruction's encoding attribute containing a file-specific character set (World Wide
Web Consortium, 2000). A potentially differing character set for an import media file will be
specified in the contenttype attribute of the source tag as a parameter to the mime type.
Any tag or attribute that could end up containing text in a different language to the other tags
may specify their own language. This is only necessary for tags that contain human-readable
text. The language is specified in the lang and dir attributes.
Search engine support
Web search engines are powerful tools to explore the textual information published on Web
servers. The principle they work from is that crawling hyperlinks that they find within known
Web pages will lead them to more Web pages, and eventually to most of the Web's content. For
all Web resources they can build a search index of their textual contents and use it for retrieval
of a hyperlink in response to a search query.
With binary time-continuous data files, indexing was previously not possible. However,
Annodex allows the integration of time-continuous data files into the crawling and indexing
paradigm of search engines through providing CMML files. A CMML file represents the
complete annotation of an Annodex file with HTML-style anchor tags in its clip tags that
enable crawling of Annodex files. Indexing can then happen on the level of the complete file or
on the level of individual clips. For the complete file, the tags in the head element (title &
meta tags) will be indexed, whereas for clips, the tags in the clip elements (desc & meta tags)
are necessary. The search result display should then display the descriptive content of the title
and desc tags, and the representative keyframe given in the img tag, to provide a nice visual
overview of the retrieved clip (see Figure 2).
For retrieval of the CMML file encapsulated in an Annodex file from an Annodex server,
HTTP’s content type negotiation is used. The search engine only needs to include into its HTTP
request an “Accept” header with a higher priority on “text/x-cmml” than on "application/x-
annodex" and a conformant Annodex server will provide the extracted CMML content for the
given Annodex resource.
Caching Web proxies
HTTP defines a mechanism to cache byte ranges of files in Web proxies. With Annodex files,
this mechanism can be used to also cache time intervals or clips of time-continuous data files,
which are commonly large-size files. To that end, the Web server must provide a mapping of the
clip or the time intervals to byte ranges. Then, the Web proxy can build up a table of ranges that
it caches for a particular Annodex resource. If it receives an Annodex resource request for a
time interval or clip that it already stores, it can serve out the data straight out of its cache. Just
like the Web server, it may however need to process the resource before serving it: the file
header parts need to be prepended to the data, the timebase needs to be adjusted, and the
queried data needs to be concatenated at the end to regain a conformant Annodex file format. As
Annodex allows parsing of files without decoding, this is a fairly simple operation, enabling a
novel use of time-continuous data on the Web.
Dynamic Annodex creation
Current Web sites use scripting extensively to automatically create HTML content with up to
date information extracted from databases. As Annodex and CMML provide clip structured
media data, it is possible to create Annodex content by scripting. The annotation and indexing
information of a clip may then be stored in a metadata database with a reference to the clip file.
A script can then select clips based by querying the metadata database and create an Annodex
file on the fly. News bulletins and video blogs are application examples which can be built with
such a functionality.
RESEARCH OPPORTUNITIES
There are a multitude of open research opportunities related to Annodex, some of which are
mentioned in this Section.
Further research is necessary for exploring transcoding of metadata. A multitude of different
markup languages for different kinds of time-continuous data already exist. CMML is a generic
means to provide structured and unstructured annotations on clips and media files. Many of the
existing ways to markup media may be transcode into CMML, and utilise the power of
Annodex. Transcoding is simple to implement for markup that is also based on XML, because
XSLT (World Wide Web Consortium, 1999C) provides a good tool to implement such scripts.
Transcoding of metadata directly leads to the question of interoperability with other standards.
MPEG-7 is such a metadata standard for which it is necessary to explore transcoding, however
MPEG-7 is more than just textual metadata and there may be more to find. Similarly,
interoperability of Annodex with standards like RTP/RTSP (Schulzrinne et al. 1996 and 1998),
DVD (dvdforum, 2000), MPEG-4 (MPEG Industry Forum, 2002), and MPEG-21(Burnett et al,
2003) will need to be explored.
Another question that frequently emerges for Annodex is the question of annotating and
indexing regions of interest within a video's imagery. We decided that structuring the spatial
domain is out of scope for the Annodex technologies and may be re-visited at a future time.
Annodex is very specifically designed to solve problems for time-continuous data, and that data
may not necessarily have a spatial domain (such as audio data). Also, on different devices the
possible interactions have to be very simple, so e.g. selecting a spatial region while viewing a
video on a mobile device is impractical. However, it may be possible for specific applications to
use image maps with CMML clips to also hyperlink and describe in the spatial domain. This is
an issue to explore in the future.
Last but not least there are many opportunities to apply and extend existing multimedia content
analysis research to automatically determine CMML markup.
CONCLUSION
This chapter presented the Annodex technology, which brings the familiar searching and surfing
capabilities of the World Wide Web to time-continuously sampled data (Pfeiffer, Parker, and
Schremmer, 2003). At the core of the technology are the Continuous Media Markup Language
CMML, the Annodex stream and file format, and clip- and time-referencing URI hyperlinks.
These enable the extensions of the Web to a Continuous Media Web with Annodex browsers,
Annodex servers, and Annodex search engines. Annodex is however more powerful as it also
represents a standard multitrack media file format with multitrack annotations, which can be
cached on Web proxies and used in Web server scripts for dynamic content creation. Therefore,
Annodex and CMML present a Web-integrated means for managing multimedia semantics.
ACKNOWLEDEMENT
The authors greatly acknowledge the comments, contributions, and proofreading of Claudia
Schremmer, who is making use of the Continuous Media Web technology in her research on
metadata extraction of meeting recordings.
REFERENCES
Annodex.net (2004). Open standards for annotating and indexing networked media. Received
January 2004 from http://www.annodex.net.
Berners-Lee, T., Fielding, R., & Masinter, L. (1998). Uniform Resource Identifiers (URI):
Generic Syntax. Internet Engineering Task Force, RFC 2396, August 1998. Received January
2003 from http://www.ietf.org/rfc/frc2396.txt.
Berners-Lee, T. with Fischetti, M. & Dertouzos, M.L. (1999). Weaving the Web: The Original
Design and Ultimate Destiny of the World Wide Web by its Inventor. Harper, San Francisco.
Burnett, I., Van de Walle, R., Hill, K., Bormans, J., & Pereira, F. (2003). MPEG-21: Goals and
Achievements. IEEE Multimedia Magazine, October-December, pp. 60-70.
Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezna, I., Huang, T. & Zakhor A. (2002).
Applications of Video-Content Analysis and Retreival. IEEE Multimedia Magazine, July-
September, 42-55.
Dublin Core Metadata Initiative (2003). The Dublin Core Metadata Element Set, v1.1. February,
2003. Received January 2004 from http://dublincore.org/documents/2003/02/04/dces.
dvdforum (2000). DVD Primer. September 2000. Received January 2004 from
http://www.dvdforum.org/tech-dvdprimer.htm.
Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L., Leach, P, & Berners-Lee, T.
(1999). Hypertext Transfer Protocol -- HTTP/1.1. Internet Engineering Task Force, RFC
2616, June 1999. Received January 2004 from http://www.ietf.org/rfc/rfc2616.txt.
Martinez, J.M., Koenen, R., & Pereira, F. (2002). MPEG-7: the Generic Multimedia Content
Description Standard. IEEE Multimedia Magazine, April-June, 78-87.
MPEG Industry Forum (2002). MPEG-4 Users Frequently Asked Questions. February 2002.
Rceived January 2004 from http://www.mpegif.org/resources/mpeg4userfaq.php.
NCSA HTTPd Development Team (1995). The Common Gateway Interface (CGI). June 1995.
Retrieved January 2004 from http://hoohoo.ncsa.uiuc.edu/cgi/.
Pfeiffer, S. (2003). The Ogg encapsulation format version 0. Internet Engineering Task Force,
RFC 3533, May 2003. Received January 2004 from http://www.ietf.org/rfc/rfc3533.txt.
Pfeiffer, S., Parker, C., & Schremmer, C. (2003). Annodex: A Simple Architecture to Enable
Hyperlinking, Search & Retrieval of Time-Continuous Data on the Web. Proceedings 5th
ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR),
Berkeley, CA, USA November, 87-93.
Pfeiffer, S., Parker, C., & Pang, A. (2003A). The Continuous Media Markup Language
(CMML), Version 2.0 (work in progress). Internet Engineering Task Force, December 2003.
Received January 2003 from http://www.ietf.org/internet-drafts/draft-pfeiffer-cmml-01.txt.
Pfeiffer, S., Parker, C., & Pang, A. (2003B). The Annodex annotation and indexing format for
time-continuous data files, Version 2.0 (work in progress). Internet Engineering Task Force,
December 2004. Received January 2003 from http://www.ietf.org/internet-drafts/draft-
pfeiffer-annodex-01.txt.
Pfeiffer, S., Parker, C., & Pang, A. (2003C). Specifying time intervals in URI queries and
fragments of time-based Web resources (BCP) (work in progress). Internet Engineering Task
Force, December 2003. Retrieved January 2004 from http://www.ietf.org/internet-drafts/draft-
pfeiffer-temporal-fragments-02.txt.
Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (1996). RTP: A Transport Protocol
for Real-Time Applications. Internet Engineering Task Force, RFC 1889, January 1996.
Received January 2004 from http://www.ietf.org/rfc/rfc1889.txt.
Schulzrinne, H., Rao, A., & Lanphier, R. (1998). Real Time Streaming Protocol (RTSP).
Internet Engineering Task Force, RFC 2326, April 1998. Received January 2003 from
http://www.ietf.org/rfc/rfc2326.txt.
World Wide Web Consortium (1999A). XML Path Language (XPath). W3C XPath, November
1999. Received January 2004 from http://www.w3.org/TR/xpath/.
World Wide Web Consortium (1999B). HTML 4.01 Specification. W3C HTML, December
1999. Received January 2004 from http://www.w3.org/TR/html4/.
World Wide Web Consortium (1999C). XSL Transformations (XSLT) Version 1.0. W3C
XSLT, November 1999. Received January 2004 from http://www.w3.org/TR/xslt/.
World Wide Web Consortium (2000). Extensible Markup Language (XML) 1.0. W3C XML,
October 2000. Received January 2004 from http://www.w3.org/TR/2000/REC-xml-20001006.
World Wide Web Consortium (2001). Synchronized Multimedia Integration Language (SMIL
2.0). W3C SMIL, August 2001. Received January 2004 from http://www.w3.org/TR/smil20/.
World Wide Web Consortium (2002). XML Pointer Language (XPointer). W3C XPointer,
August 2002. Received January 2004 from http://www.w3.org/TR/xptr/.
Xiphophorus (2004). Building a new era of Open multimedia. Received January 2004 from
http://www.xiph.org/.
BIOGRAPHIES
Silvia Pfeiffer
Silvia Pfeiffer received her Masters Degree in Computer Science and Business Management
from the University of Mannheim, Germany, in 1993. She returned to that university in 1994 to
pursue a Ph.D. within the MoCA (Movie Content Analysis) project, exploring novel extraction
methods for audio-visual content and novel applications using these. Her thesis of 1999 was
about audio content analysis of digital video. After this, she moved to Australia to work as a
research scientist in digital media at the CSIRO in Sydney. She has explored several projects
involving automated content analysis in the compressed domain, focusing on segmentation
applications. She has also actively submitted to MPEG-7. In January 2001 she had initial ideas
for a Web of continuous media, the specifications of which were worked out within the
Continuous Media Web research group that she is heading.
Conrad Parker
Conrad Parker works as a Senior Software Engineer at CSIRO Australia. He is actively involved
in various open source multimedia projects, including development of the Linux and Unix sound
editor Sweep. With Dr. Pfeiffer, he developed the mechanisms for streamable metadata
encapsulation used in the Annodex format, and is responsible for development of the core
software libraries, content creation tools and server modules of the reference implementation.
His research focuses on interesting applications of dynamic media generation, and improved
TCP congestion control for efficient delivery of media resources.
André Pang
André Pang received his Bachelor of Science with honours at the University of New South
Wales, in Sydney, Australia, in 2003. He has been involved with the Continuous Media Web
project since 2001, helping to develop the first specifications and implementations of the
Annodex technology and implementing the first Annodex Browser under Mac OS X. André is
involved in integrating Annodex support into several media frameworks, such as the VideoLAN
media player, DirectShow, xine, and QuickTime. In his spare time, he enjoys researching about
compilers and programming languages, and also codes on many different open-source projects.
... In mid–2003, CSIRO launched the Continuous Media Web (CMWeb) [1, 13, 12], designed to solve the problem of " dark matter " of time–continuous media such as audio and video on the Web. The motivation for the proposed streamable annotated and indexed Annodex file format is the integration of time– continuous media into the URL–based hyperlinking paradigm, resulting in " surfable " and " searchable " media. ...
Article
Full-text available
The Continuous Media Web (CMWeb) integrates time–continuous media into the searching, linking, and browsing functionality of the World Wide Web. The file format underlying the CMWeb technology, Annodex, streams the media content multiplexed with XML–markup in the Continuous Media Markup Language (CMML). CMML contains information relevant to the whole media file (e.g., title, author, language) as well as time–sensitive information (e.g., topics, speakers, time–sensitive hyperlinks). This paper discusses the challenges of automatically generating Annodex streams from complex annotated recordings collected for use in linguistic research. We are particularly interested in annotated recordings of meetings and teleconferences and regard Annodex and its media browsing paradigm as a novel and rich way of interacting with such recordings. The paper presents our experiments with generating CMML and their corresponding Annodex files from hand annotated meeting recordings.
Article
Full-text available
Since the year 2000 a project under the name of "Continuous Media Web", CMWeb, has explored how to make video (and incidentally audio) a first class citizen on the Web. The project has led to a set of open specifications and open source implementations, which have been included into the Xiph set of open media technologies. In the spirit of the Web, specifications for a Video Web should be based on unencumbered formats, which is why Xiph was chosen.
Article
Full-text available
In this paper, we provide a brief survey of the mul timedia information retrieval domain as well as introduce s ome ideas investigated in the special issue. We hope that the contributions of this issue provide motivation for readers to dea l with the current challenges and problems. Such contributions are the basis of tomorrow's multimedia information systems. Our aims are to clarify some notions raised by this new technology by reviewing the current capabilities and the potential usefulne ss to users in various areas. The research and development issues cover a wide range of fields, many of which are shared with medi a processing, signal processing, data base technologies and data mining.
Technical Report
Full-text available
This memorandum describes RTP, the real-time transport protocol. RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data, over multicast or unicast network services. RTP does not address resource reservation and does not guarantee quality-of-service for real-time services. The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scalable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers. The protocol supports the use of RTP-level translators and mixers.
Article
Full-text available
MPEG-21 is an open standards-based framework for multimedia delivery and consumption. It aims to enable the use of multimedia resources across a wide range of networks and devices. We discuss MPEG-21's parts, achievements, ongoing activities, and opportunities for new technologies.
Conference Paper
Full-text available
Today, Web browsers can interpret an enormous amount of different file types, including time-continuous data. By consuming an audio or video, however, the hyperlinking functionality of the Web is "left behind" since these files are typically unsearchable, thus not indexed by common text-based search engines. Our XML-based CMML annotation format and the Annodex file format presented in this paper are designed to solve this problem of "dark matter" on the Internet: Continuous media files are annotated and indexed (i.e., Annodexed), enabling hyperlinks to and from the media. Furthermore, the hyperlinks do not typically point to an entire media file, but to and from arbitrary fragments or intervals. The standards proposed in to create the Continuous Media Web have been submitted to the IETF for review.
Article
The recently completed ISO/IEC, International Standard 15938, formally called the Multimedia Content Description Interface (but better known as MPEG-7), provides a rich set of tools for completely describing multimedia content. The standard wasn't just designed from a content management viewpoint (classical archival information). It includes an innovative description of the media's content, which we can extract via content analysis and processing. MPEG-7 also isn't aimed at any one application; rather, the elements that MPEG-7 standardizes support as broad a range of applications as possible. This is one of the key differences between MPEG-7 and other metadata standards; it aims to be generic, not targeted to a specific application or application domain. The article provides a comprehensive overview of MPEG-7's motivation, objectives, scope, and components
Article
A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource. This document defines the generic syntax of URI, including both absolute and relative forms, and guidelines for their use; it revises and replaces the generic definitions in RFC 1738 and RFC 1808.
Applications of Video-Content Analysis and Retreival
  • N Dimitrova
  • H Zhang
  • J Shahraray
  • B Sezna
  • I Huang
  • T A Zakhor
Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezna, I., Huang, T. & Zakhor A. (2002). Applications of Video-Content Analysis and Retreival. IEEE Multimedia Magazine, July-September, 42-55
Real Time Streaming Protocol (RTSP) Internet Engineering Task Force, RFC 2326
  • H Schulzrinne
  • A Rao
  • R Lanphier
Schulzrinne, H., Rao, A., & Lanphier, R. (1998). Real Time Streaming Protocol (RTSP). Internet Engineering Task Force, RFC 2326, April 1998. Received January 2003 from http://www.ietf.org/rfc/rfc2326.txt.
HTML 4.01 Specification
World Wide Web Consortium (1999B). HTML 4.01 Specification. W3C HTML, December 1999. Received January 2004 from http://www.w3.org/TR/html4/.