Roeland OrdelmanUniversiteit Twente / Netherlands Institute for Sound and Vision · Human Media Interaction / Research & Development
Universiteit Twente / Netherlands Institute for Sound and Vision
Human Media Interaction / Research & Development
Senior Researcher / Research Manager
Skills and Expertise
Cross-Media Interaction · Research & Development
Netherlands Institute for Sound and Vision, Hilversum, The Netherlands · Research & Development
Universiteit Twente · Department of Human Media Interaction (HMI)
Research Items (107)
Video-to-video linking systems allow users to explore and exploit the content of a large-scale multimedia collection interactively and without the need to formulate specific queries. We present a short introduction to video-to-video linking (also called ‘video hyperlinking’), and describe the latest edition of the Video Hyperlinking (LNK) task at TRECVid 2016. The emphasis of the LNK task in 2016 is on multimodality as used by videomakers to communicate their intended message. Crowdsourcing makes three critical contributions to the LNK task. First, it allows us to verify the multimodal nature of the anchors (queries) used in the task. Second, it enables us to evaluate the performance of video-to-video linking systems at large scale. Third, it gives us insights into how people understand the relevance relationship between two linked video segments. These insights are valuable since the relationship between video segments can manifest itself at different levels of abstraction.
The value and importance of Benchmark Evaluations is widely acknowledged. Benchmarks play a key role in many research projects. It takes time, a well-balanced team of domain specialists preferably with links to the user community and industry, and a strong involvement of the research community itself to establish a sound evaluation framework that includes (annotated) data sets, well-defined tasks that reflect the needs in the 'real world', a proper evaluation methodology, ground-truth, including a strategy for repetitive assessments, and last but not least, funding. Although the benefits of an evaluation framework are typically reviewed from a perspective of 'research output' --e.g., a scientific publication demonstrating an advance of a certain methodology-- it is important to be aware of the value of the process of creating a benchmark itself: it increases significantly the understanding of the problem we want to address and as a consequence also the impact of the evaluation outcomes. In this talk I will overview the history of a series of tasks focusing on audiovisual search emphasizing its 'multimodal' aspects, starting in 2006 with the workshop on 'Searching Spontaneous Conversational Speech' that led to tasks in CLEF and MediaEval ("Search and Hyperlinking"), and recently also TRECVid ("Video Hyperlinking"). The focus of my talk will be on the process rather than on the results of these evaluations themselves, and will address cross-benchmark connections, and new benchmark paradigms, specifically the integration of benchmarking in industrial 'living labs' that are becoming popular in some domains.
In this paper we report on a two-stage evaluation of unsupervised labeling of audiovisual content using collateral text data sources to investigate how such an approach can provide acceptable results for given requirements with respect to archival quality, authority and service levels to external users. We conclude that with parameter settings that are optimized using a rigorous evaluation of precision and accuracy, the quality of automatic term-suggestion is sufficiently high. We furthermore provide an analysis of the term extraction after being taken into production, where we focus on performance variation with respect to term types and television programs. Having implemented the procedure in our production work-flow allows us to gradually develop the system further and to also assess the effect of the transformation from manual to automatic annotation from an end-user perspective. Additional future work will be on deploying different information sources including annotations based on multimodal video analysis such as speaker recognition and computer vision.
This paper overviews ongoing work that aims to support end-users in conveniently exploring and exploiting large audiovisual archives by deploying multiple multimodal linking approaches. We present ongoing work on multimodal video hyperlinking, from a perspective of unconstrained link anchor identification and based on the identification of named entities, and recent attempts to implement and validate the concept of outside-in linking that relates current events to archive content. Although these concepts are not new, current work is revealing novel insights, more mature technology, development of benchmark evaluations and emergence of dedicated workshops which are opening many interesting research questions on various levels that require closer collaboration between research communities.
The Workshop on Speech, Language and Audio in Multimedia (SLAM) positions itself at at the crossroad of multiple scientific fields (music and audio processing, speech processing, natural language processing and multimedia) to discuss and stimulate research results, projects, datasets and benchmarks initiatives where audio, speech and language are applied to multimedia data. While the first two editions were collocated with major speech events, SLAM'15 is deeply rooted in the multimedia community, opening up to computer vision and multimodal fusion. To this end, the workshop emphasizes video hyperlinking as an showcase where computer vision meets speech and language. Such techniques provide a powerful illustration of how multimedia technologies incorporating speech, language and audio can make multimedia content collections better accessible, and thereby more useful, to users.
In this paper we report on an evaluation of unsupervised labeling of audiovisual content using collateral text data sources to investigate how such an approach can provide acceptable results given requirements with respect to archival quality, authority and service levels to external users. We conclude that with parameter settings that are optimized using a rigorous evaluation of precision and accuracy, the quality of automatic term-suggestion are sufficiently high. Having implemented the procedure in our production work-flow allows us to gradually develop the system further and also assess the effect of the transformation from manual to automatic from an end-user perspective. Additional future work will be on deploying different information sources including annotations based on multimodal video analysis such as speaker recognition and computer vision.
textabstractArchives of cultural heritage organisations typically consist of collections in various formats (e.g. photos, video, texts) that are inherently related. Often, such disconnected collections represent value in itself but effectuating links between 'core' and 'context' collection items in various levels of granularity could result in a 'one-plus-one-makes-three' scenario both from a contextualisation perspective (public presentations, research) and access perspective. A key issue is the identification of contextual objects that can be associated with objects in the core collections, or the other way around. Traditionally, such associations have been created manually. For most organizations however, this approach does not scale. In this paper, we describe a case in which a semi-automatic approach was employed to create contextual links between television broadcast schedules in program guides (context collection) and the programs in the archive (core collection) of a large audiovisual heritage organisation.
Multimedia hyperlinking is an emerging research topic in the context of digital libraries and (cultural heritage) archives. We have been studying the concept of video-to-video hyperlinking from a video search perspective in the context of the MediaEval evaluation benchmark for several years. Our task considers a use case of exploring large quantities of video content via an automatically created hyperlink structure at the media fragment level. In this paper we report on our findings, examine the features of the definition of video hyperlinking based on results, and discuss lessons learned with respect to evaluation of hyperlinking in real-life use scenarios.
The Workshop on Speech, Language and Audio in Multime-dia (SLAM) positions itself at at the crossroad of multiple scientific fields—music and audio processing, speech processing , natural language processing and multimedia—to discuss and stimulate research results, projects, datasets and benchmarks initiatives where audio, speech and language are applied to multimedia data. While the first two editions were collocated with major speech events, SLAM'15 is deeply rooted in the multimedia community, opening up to computer vision and multimodal fusion. To this end, the workshop emphasizes video hyperlinking as an showcase where computer vision meets speech and language. Such techniques provide a powerful illustration of how multime-dia technologies incorporating speech, language and audio can make multimedia content collections better accessible, and thereby more useful, to users.
Semantic linking has a potential to enrich the audiovisual experience for users of television or radio broadcast archives. Recently, automatic semantic linking, has received increased attention, especially as second screen applications for television broadcasts are emerging. Semantic linking for radio broadcasts can enrich radio listening experience in a similar manner in combination with second screen-like applications. While the development of such applications is gaining popularity, little is known about the information in a radio program that may be interesting for link creation from a user perspective. We conducted a user study on semantic linking for radio broadcasts in order to know what information users regard as suitable anchors and what kind of information they like as targets. We found that users often regard topic and person as the best link anchors in the program. Additionally, we found that frequency and timing of information elements in a radio program do not dominate the users' selection of anchors. Furthermore, we found that there is a low agreement among users on regarding certain information elements as anchors. For practical reasons the study is conducted with 10 minutes of radio broadcast material of a particular program type, and with a total of 22 participants.
The EU FP7 project AXES aims at better understanding the needs of archive users and supporting them with systems that reach beyond the state-of-the-art. Our system allows users to instantaneously retrieve content using metadata, spoken words, or a vocabulary of reliably detected visual concepts comprising places, objects and events. Additionally, users can query for new concepts, for which models are learned on-the-fly, using training images obtained from an internet search engine. Thanks to advanced analysis and indexation methods, relevant material can be retrieved within seconds. Our system supports different types of models for object categories (e.g. “bus” or “house”), specific objects (landmarks or logos), person categories (e.g. “people with moustaches”), or specific persons (e.g. “President Obama”). Next to text queries, we support query-by-example, which retrieves content containing the same location, objects, or faces shown in provided images. Finally, our system provides alternatives to query-based retrieval by allowing users to browse archives using generated links. Here we evaluate the precision of the retrieved results based on textual queries describing visual content, with the queries extracted from user testing query logs.
This report describes metrics for the evaluation of the effectiveness of segment-based retrieval based on existing binary information retrieval metrics. This metrics are described in the context of a task for the hyperlinking of video segments. This evaluation approach re-uses existing evaluation measures from the standard Cranfield evaluation paradigm. Our adaptation approach can in principle be used with any kind of effectiveness measure that uses binary relevance, and for other segment-baed retrieval tasks. In our video hyperlinking setting, we use precision at a cut-off rank n and mean average precision.
Scholars are yet to make optimal use of Oral History collections. For the uptake of digital research tools in the daily working practice of researchers, practices and conventions commonly adhered to in the subfields in the humanities should be taken into account during development. To this end, in the Oral History Today project a research tool for exploring Oral History collections is developed in close collaboration with scholarly researchers. This paper describes four stages of scholarly research and the first steps undertaken to incorporate requirements of these stages in a digital research environment.
This paper reports on the results of a quantitative analysis of user requirements for audiovisual search that allow the categorisation of requirements and to compare requirements across user groups. The categorisation provides clear directions with respect to the prioritisation of system features from the perspective of the development of systems for specific, single user groups and systems that have a more general target user group.
Video hyperlinking is regarded as a means to enrich interactive television experiences. Creating links manually however has limitations. In order to be able to automate video hyperlinking and increase its potential we need to have a better understanding of how both broadcasters that supply interactive television and the end-users approach and perceive hyperlinking. In this paper we report on the development of an editor tool for supervised automatic video hyperlinking that will allow us to investigate video hyperlinking in a real-life scenario.
Although linking video to additional information sources seems to be a sensible approach to satisfy information needs of user, the perspective of users is not yet analyzed on a fundamental level in real-life scenarios. However, a better understanding of the motivation of users to follow links in video, which anchors users prefer to link from within a video, and what type of link targets users are typically interested in, is important to be able to model automatic linking of audiovisual content appropriately. In this paper we report on our methodology towards eliciting user requirements with respect to video linking in the course of a broader study on user requirements in searching and a series of benchmark evaluations on searching and linking.
Searching for relevant webpages and following hyperlinks to related content is a widely accepted and effective approach to information seeking on the textual web. Existing work on multimedia information retrieval has focused on search for individual relevant items or on content linking without specific attention to search results. We describe our research exploring integrated multimodal search and hyperlinking for multimedia data. Our investigation is based on the MediaEval 2012 Search and Hyperlinking task. This includes a known-item search task using the Blip10000 internet video collection, where automatically created hyperlinks link each relevant item to related items within the collection. The search test queries and link assessment for this task was generated using the Amazon Mechanical Turk crowdsourcing platform. Our investigation examines a range of alternative methods which seek to address the challenges of search and hyperlinking using multimodal approaches. The results of our experiments are used to propose a research agenda for developing effective techniques for search and hyperlinking of multimedia content.
The negative consequences of cyberbullying are becoming more alarming every day and technical solutions that allow for taking appropriate action by means of automated detection are still very limited. Up until now, studies on cyberbullying detection have focused on individual comments only, disregarding context such as users’ characteristics and profile information. In this paper we show that taking user context into account improves the detection of cyberbullying.
In this paper we report our experiments and results for the brave new searching and hyperlinking tasks for the Medi-aEval Benchmark Initiative 2012. The searching task in-volves finding target video segments based on a short natural language sentence query and the hyperlinking task involves finding links from the target video segments to other related video segments in the collection using a set of anchor seg-ments in the videos that correspond to the textual search queries. To find the starting points in the video, we only used speech transcripts and metadata as evidence source, however, other visual features (for e.g., faces, shots and keyframes) might also affect results for a query. We indexed speech transcripts and metadata, furthermore, the speech transcripts were indexed at speech segment level and at sentence level to improve the likelihood of finding jump-in-points. For linking video segments, we computed k-nearest neighbours of video segments using euclidean distance.
The search and hyperlinking task at MediaEval 2014 is the third edition of this task. As in previous versions, it consisted of two sub-tasks: (i) answering search queries from a collection of roughly 2700 hours of BBC broadcast TV material, and (ii) linking anchor segments from within the videos to other target segments within the video collection. For MediaEval 2014, both sub-tasks were based on an ad-hoc retrieval scenario, and were evaluated using a pooling procedure across participants submissions with crowdsourcing relevance assessment using Amazon Mechanical Turk.
The MediaEval Multimedia Benchmark leveraged community cooperation and crowdsourcing to develop a large Internet video dataset for its Genre Tagging and Rich Speech Retrieval tasks.
Friendships, relationships and social communications have all gone to a new level with new definitions as a result of the invention of online social networks. Meanwhile, alongside this transition there is increasing evi-dence that online social applications have been used by children and adoles-cents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the users involved in cyberbullying. We hypothesis that incorporation of the users' profile, their characteristics, and post-harassing behaviour, for instance, posting a new status in another social network as a reaction to their bullying experience, will improve the accuracy of cyberbullying detection. Cross-system analyses of the users' behaviour -monitoring users' reactions in different online environ-ments -can facilitate this process and could lead to more accurate detection of cyberbullying. This paper outlines the framework for this faceted approach.
We present an exploratory study of the retrieval of semi- professional user-generated Internet video. The study is based on the MediaEval 2011 Rich Speech Retrieval (RSR) task for which the dataset was taken from the Internet shar- ing platform blip.tv, and search queries associated with spe- cific speech acts occurring in the video. We compare re- sults from three participant groups using: automatic speech recognition system transcript (ASR), metadata manually as- signed to each video by the user who uploaded it, and their combination. RSR 2011 was a known-item search for a sin- gle manually identified ideal jump-in point in the video for each query where playback should begin. Retrieval effec- tiveness is measured using the MRR and mGAP metrics. Using different transcript segmentation methods the partic- ipants tried to maximize the rank of the relevant item and to locate the nearest match to the ideal jump-in point. Results indicate that best overall results are obtained for topically homogeneous segments which have a strong overlap with the relevant region associated with the jump-in point, and that use of metadata can be beneficial when segments are unfocused or cover more than one topic.
As a result of the invention of social networks, friendships, relationships and social communication are all undergoing changes and new definitions seem to be applicable. One may have hundreds of "friends" without even seeing their faces. Meanwhile, alongside this transition there is increasing evidence that online social applications are used by children and adolescents for bullying. State-of-the-art studies in cyberbullying detection have mainly focused on the content of the conversations while largely ignoring the characteristics of the actors involved in cyberbullying. Social studies on cyberbullying reveal that the written language used by a harasser varies with the author"s features including gender. In this study we used a support vector machine model to train a gender-specific text classifier. We demonstrated that taking gender-specific language features into account improves the discrimination capacity of a classifier to detect cyberbullying.
While automatic linking in text collections is well understood, little is known about links in images. In this work, we investigate two aspects of anchors, the origin of a link, in images: 1) the requirements of users for such anchors, e.g. the things users would like more information on, and 2) possible evaluation methods assessing anchor selection algorithms. To investigate these aspects, we perform a study with 102 users. We find that 59% of the required anchors are image segments, as opposed to the whole image, and most users require information on displayed persons. The agreement of users on the required anchors is too low (often below 30%) for a ground truth-based evaluation, which is the standard IR evaluation method. As an alternative, we propose a novel evaluation method based on improved search performance and user experience.
The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments.
Safeguarding the massive body of audiovisual content, including rich music collections, in audiovisual archives and enabling access for various types of user groups is a prerequisite for unlocking the social-economic value of these collections. Data quantities and the need for specific content descriptors however, force archives to re-evaluate their annotation strategies and access models, and incorporate technology in the archival workflow. It is argued that this can only be successfully done provided that user requirements are studied well and that new approaches are introduced in a well-balanced manner, fitting in with traditional archival perspectives, and by bringing the archivist in the technology loop by means of education and by deploying hybrid work-flows for technology aided annotation.
This paper presents the audiovisual content access on the Internet. The use of multimedia on the Internet is large and growing at an extraordinary rate. By 2015, one-million minutes of video content will cross the Internet every second. Audiovisual archives are investing in large scale digitization efforts of their analog holdings and, in parallel, ingesting an ever-increasing amount of born-digital files in their digital deposits. Digitization opens up new access paradigms and boosts reuse of audiovisual content. Query-log analyses show the shortcomings of manual annotations.
Archival practice is shifting from the analogue to the digital world. A specific subset of heritage collections that impose interesting challenges for the field of language and speech technology are spoken word archives. Given the enormous backlog at audiovisual archives of unannotated materials and the generally global level of item description, collection disclosure and item access are both at risk, and (semi-)automated methods for analysis and annotation may help to increase the use and reuse of these rich content collections. In several HMI projects the interplay has been investigated between evolving user scenarios and user requirements for spoken audio collections on the one hand, and the potential of automatic annotation and search technology for the improved accessibility and search paradigms on the other hand. In this paper we will present an overview of the state-of-the-art in metadata generation for audio content and explain the crucial importance of involving user groups in the design of research agendas and road maps for novel applications in this domain.
This paper describes the participation of the University of Twente team at the Rich Text Retrieval Task of the Media Eval Benchmark Initiative 2011. The goal of the task is to find entry points of relevant parts of videos to reduce the browsing effort of searchers. This is our first participation, therefore our main focus is to create a baseline system which can be improved in the future. We experiment with different evidence sources (ASR and meta data) together with a basic score combination function. We also experiment with different entry points relative to the segments found by the contained evidence.
Automatically generated tags and geotags hold great promise to improve access to video collections and online communities. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available in-formation including user-generated metadata, speech recognition transcripts, audio, and visual features.
The aim of this paper is to reflect on the factors that impede a clear communication and a more fruitful collaboration between humanities scholars and ICT developers. One of the observations is that ICT-researchers who design tools for humanities researchers, are less inclined to take into account that each stage of the scholarly research process requires ICT-support in a different manner or through different tools. Likewise scholars in the humanities often have prejudices concerning ICT-tools, based on lack of knowledge and fears of technology-driven agendas. If the potential for methodological innovation of the humanities is to be realized, the gap between the mindset of ICT-researchers and that of archivists and scholars in the humanities needs to be bridged. Our assumption is that a better insight into the variety of uses of digital collections and a user-inspired classification of ICT-tools, can help to achieve a greater conceptual clarity among both users and developers. This paper presents such an overview in the form of a typology for the audio-visual realm: examples of what role digital audio-visual archives can play at various research stages, and an inventory of the challenges for the parties involved.
In contrast with the large amounts of potential interesting research material in digital multimedia repositories, the opportunities to unveil the gems therein are still very limited. The Oral History project ‘Verteld Verleden’ (Dutch literal translation of Oral History) that is currently running in The Netherlands, focuses on improving access to spoken testimonies in collections, spread over many Dutch cultural heritage institutions, by deploying modern technology both concerning infrastructure and access. Key objective in the project is mapping the various specific requirements of collection owners and researchers regarding both publishing and access by means of current state-of-the-technology. In order to demonstrate the potential, Verteld Verleden develops an Oral History portal that provides access to distributed collections. At the same time, practical step-by-step plans are provided to get to work with modern access technologies. In this way, a solid starting point for sustained access to Oral History collections can be established.
Social science is often concerned with the emergence of collective behavior out of the interactions of large numbers of individuals, but in this regard it has long suffered from a severe measurement problem - namely that individual-level behavior and ...
In this technical demonstration, we showcase a multimedia search engine that facilitates semantic access to archival rock n' roll concert video. The key novelty is the crowdsourcing mechanism, which relies on online users to improve, extend, and share, automatically detected results in video fragments using an advanced timeline-based video player. The user-feedback serves as valuable input to further improve automated multimedia retrieval results, such as automatically detected concepts and automatically transcribed interviews. The search engine has been operational online to harvest valuable feedback from rock n' roll enthusiasts.
Narrative peaks are points at which the viewer perceives a spike in the level of dramatic tension within the narrative flow of a video. This paper reports on four approaches to narrative peak detection in television documentaries that were developed by a joint team consisting of members from Delft University of Technology and the University of Twente within the framework of the VideoCLEF 2009 Affect Detection task. The approaches make use of speech recognition transcripts and seek to exploit various sources of evidence in order to automatically identify narrative peaks. These sources include speaker style (word choice), stylistic devices (use of repetitions), strategies strengthening viewers’ feelings of involvement (direct audience address) and emotional speech. These approaches are compared to a challenging baseline that predicts the presence of narrative peaks at fixed points in the video, presumed to be dictated by natural narrative rhythm or production convention. Two approaches deliver top narrative peak detection results. One uses counts of personal pronouns to identify points in the video where viewers feel most directly involved. The other uses affective word ratings to calculate scores reflecting emotional language.
After two successful years at SIGIR in 2007 and 2008, the third workshop on Searching Spontaneous Conversational Speech (SSCS 2009) was held conjunction with the ACM Multimedia 2009. The goal of the SSCS series is to serve as a forum that brings together the disciplines that collaborate on spoken content retrieval, including information retrieval, speech recognition and multimedia analysis. Multimedia collections often contain a speech track, but in many cases it is ignored or not fully exploited for information retrieval. Currently, spoken content retrieval research is expanding beyond highly-conventionalized domains such as broadcast news in to domains involving speech that is produced spontaneously and in conversational settings. Such speech is characterized by wide variability of speaking styles, subject matter and recording conditions. The work presented at SSCS 2009 included techniques for searching meetings, interviews, telephone conversations, podcasts and spoken annotations. The work encompassed a large range of approaches including using subword units, exploiting dialogue structure, fusing retrieval models, modeling topics and integrating visual features. Taken in sum, the workshop demonstrated the high potential of new ideas emerging in the area of speech search and also reinforced the need for concentrated research devoted to the classic challenges of spoken content retrieval, many of which remain yet unsolved.
We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported narrative peaks. The second investigates affective triggers in a conversational setting using a corpus of recorded interactions, annotated with continuous affective ratings, between a human interlocutor and an emotionally colored agent. In each case, we build a "one-sided" model using indicators derived from the speech of one participant. Our classification experiments confirm the viability of our models and provide insight into useful features.
News-related content is nowadays among the most popular types of content for users in everyday applications. Although the generation and distribution of news content has become commonplace, due to the availability of inexpensive media capturing devices and the development of media sharing services targeting both professional and user-generated news content, the automatic analysis and annotation that is required for supporting intelligent search and delivery of this content remains an open issue. In this paper, a complete architecture for knowledge-assisted multimodal analysis of news-related multimedia content is presented, along with its constituent components. The proposed analysis architecture employs state-of-the-art methods for the analysis of each individual modality (visual, audio, text) separately and proposes a novel fusion technique based on the particular characteristics of news-related content for the combination of the individual modality analysis results. Experimental results on news broadcast video illustrate the usefulness of the proposed techniques in the automatic generation of semantic annotations.
The spoken word is a valuable source of semantic information. Techniques that exploit the spoken word by making use of speech recognition or spoken audio analysis hold clear potential for improving multimedia search. Nonetheless, speech technology remains underexploited by systems that provide access to spoken audio or video with a speech track. Indexing the spoken audio produced by speakers engaging in conversation or otherwise speaking spontaneously is particularly challenging. The challenges arise due to the wide variability and highly unstructured nature of unplanned, informal speech. Development of approaches that can effectively exploit the semantic content of spontaneous, conversational speech requires integration of speech recognition, audio processing, multimedia analysis and information retrieval. The SSCS workshop series is devoted to providing a forum where scientists engaged in spoken content retrieval research at the intersection of these disciplines can meet, present and discuss recent research results and also formulate a common vision on the future of spoken content retrieval. The research papers presented at SSCS 2010 cluster around topics that are central for spoken content retrieval. Two papers focus on specific indexing techniques applied to spontaneous speech: speaker role recognition and concept detection. Two papers treat Spoken Term Detection, addressing the challenge of terms that cannot be indexed using conventional approaches since they are not contained in the speech recognizer vocabulary (i.e., the so-called Out-Of-Vocabulary problem). Finally, three papers are devoted to topics related to the automatic segmentation of spontaneous conversational content and deal with issues involving the combination of automatic segmentation and information retrieval. SSCS 2010 continues the tradition of past years by including a demonstration session that allows hands-on interaction with systems implementing state-of-the-art approaches to spoken content retrieval. Five demonstration papers give the details of the systems presented. SSCS 2010 includes a number of presentations by invited speakers who address topics related to the user perspective on spoken content retrieval and to domains that are anticipated to give rise to future issues faced by scientists working in the field.
Techniques for automatic annotation of spoken content making use of speech recognition technology have long been characterized as holding unrealized promise to provide access to archives inundated with undisclosed multimedia material. This paper provides an overview of techniques and trends in semantic speech retrieval, which is taken to encompass all approaches offering meaning-based access to spoken word collections. We present descriptions, examples and insights for current techniques, including facing real-world heterogenity, aligning parallel resources and exploiting collateral collections. We also discuss ways in which speech recognition technology can be used to create multimedia connections that make new modes of access available to users. We conclude with an overview of the challenges for semantic speech retrieval in the workflow of a real-world archive and perspectives on future tasks in which speech retrieval integrates information related to affect and appeal, dimensions that transcend topic.
Given the enormous backlog at audiovisual archives and the generally global level of item description, collection disclosure and item access are both at risk. At the same time, archival practice is seeking to evolve from the analogue to the digital world. CHoral investigates the role automatic annotation and search technology can play in improving disclosure and access of digitized spoken word collections during and after this transfer. The core business of the CHoral project is to design and build technology for spoken document retrieval for heritage collections. In this paper, we will argue that in addition to solving technological issues, closer attention is needed for the work-flow and daily practice at audiovisual archives on the one hand, and the state-of-the-art in technology on the other. Analysis of the interplay is needed to ensure that new developments are mutually beneficial and that continuing cooperation can indeed bring envisioned advancements.
StreetTiVo is a project that aims at bringing research results into the living room; in particular, a mix of current results in the areas of Peer-to-Peer XML Database Management System (P2P XDBMS), advanced multimedia analysis techniques, and advanced information retrieval techniques. The project develops a plug-in application for the so-called Home Theatre PCs, such as set-top boxes with MythTV or Windows Media Center Edition installed, that can be considered as programmable digital video recorders. StreetTiVo distributes compute-intensive multimedia analysis tasks over multiple peers (i.e., StreetTiVo users) that have recorded the same TV program, such that a user can search in the content of a recorded TV program shortly after its broadcasting; i.e., it enables near real-time availability of the meta-data (e.g., speech recognition) required for searching the recorded content. StreetTiVo relies on our P2P XDBMS technology, which in turn is based on a DHT overlay network, for distributed collaborator discovery, work coordination and meta-data exchange in a volatile WAN environment. The technologies of video analysis and information retrieval are seamlessly integrated into the system as XQuery functions.
Spoken document retrieval research effort invested into developing broadcast news retrieval systems has yielded impressive results. This paper is the introduction the proceedings of the 3rd workshop aiming at the advancement of the field in less explored domains (SSCS2009) which was organized in conjunction to the ACM Multimedia Conference in Beijing.
This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken heritage archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, we at least want to provide search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition - supporting e.g., within-document search - are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is not yet satisfactory, and requires additional research.
The second workshop on Searching Spontaneous Conversational Speech (SSCS 2008) was held in Singapore on July 24, 2008 in conjunction with the 31st Annual International ACM SIGIR Conference. The goal of the workshop was to bring the speech community and the information retrieval community together. The forum was designed to be conducive to the close interaction and the intense discussion necessary to promote fusion of these fields into a single discipline with a concerted vision of spoken content retrieval. At the workshop, talks and posters were presented covering a wide range of topics including vocabulary in- dependent search, spoken term detection, combination of models/indexes, use of speech recognition lattices for search, segmentation, temporal analysis, benchmarking, exploitation of prosody, speech surrogates for user interfaces and multi-language collections. Demon- strations of speech-based retrieval systems from a variety of application domains introduced a strong practical emphasis into the workshop program. The workshop concluded with a panel discussion, whose goal it was to identify future research directions for speech retrieval. Among the important challenges identified during the panel discussions were: dealing with large scale multimedia collections, representing audio/video content eectively in the user interface, focusing on perfecting the component technologies on which speech retrieval sys- tems are based, and developing systems and approaches that will enable users (both content seekers and content providers) to actively create their own speech search applications or contribute to the indexability of their content.
In addition to multimedia collections and their metadata, there often is a variety of collateral data sources available on (parts of) a collection. Collateral data - secondary information objects that relate to the primary multimedia documents - can be very useful in the process of automated generation of annotations for multimedia archives in that they reduce both costs and effort in annotation and access. Furthermore, they can be used to enhance result presentation in retrieval engines. To optimally exploit collateral data, methods for automatic indexing as well as changes in the current archiving workflow are proposed.
In this paper, a complete architecture for knowledge-assisted cross-media analysis of News-related multimedia content is presented, along with its constituent components. The proposed analysis architecture employs state-of-the-art methods for the analysis of each individual modality (visual, audio, text) separately, and proposes a fusion technique based on the particular characteristics of News-related content for the combination of the individual modality analysis results. Experimental results on news broadcast video illustrate the usefulness of the proposed techniques in the automatic generation of semantic video annotations.
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed.
The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections.
The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built. With funding of the Dutch and Flemish governments and research foundations the present joint Dutch-Flemish STEVIN programme was put in place in 2004. One of the aims of the STEVIN programme is to realize an appropriate digital language infrastructure for Dutch. The programme also intends to stimulate strategic research in the domains of language and speech technology. The compilation of a reference corpus for Dutch has been identified as one of the priorities in the programme. Such a corpus is considered one of the prerequisites for the development of other re- sources, various tools, and applications. It is expected that once the corpus is available it will give a significant boost to natural language processing involving the Dutch language. The reference corpus should be a well-structured, balanced collection of text samples tailored to the uses to which the corpus is going to be put. The contents of the corpus as well as the nature of the annotations to be provided are to be largely determined by the needs of ongoing and projected research and development in the fields of corpus-based nat- ural language processing. Applications such as informa- tion extraction, question-answering, document classifica- tion, and automatic abstracting that are based on underlying corpus-based techniques will benefit from the large-scale analysis of particular features in the corpus. Apart from supporting corpus-based modeling, the corpus will consti- tute a test bed for evaluating applications, whether or not these applications are corpus-based. The construction of a reference corpus requires that moti- vated decisions be taken for all aspects of its design, en- coding, markup, and annotation schemes, while also vari- ous protocols and procedures must be in place. Therefore, from June 2005 until December 2006, the STEVIN pro- gramme funded the Dutch language Corpus Initiative (D-
Decoders that make use of token-passing restrict their search space by various types of token pruning. With use of the Language Model Look-Ahead (LMLA) technique it is pos- sible to increase the number of tokens that can be pruned with- out loss of decoding precision. Unfortunately, for token passing decoders that use single static pronunciation prefix trees, full n-gram LMLA increases the needed number of language model probability calculations considerably. In this paper a method for applying full n-gram LMLA in a decoder with a single static pronunciation tree is introduced. The experiments show that this method improves the speed of the decoder without an in- crease of search errors. Index Terms: Automatic speech recognition, decoding, token pruning, language model look-ahead
The second workshop on Searching Spontaneous Conversational Speech (SSCS 2008) was held in Singapore on July 24, 2008 in conjunction with the 31st Annual International ACM SIGIR Conference. The goal of the workshop was to bring the speech community and the information retrieval community together. The forum was designed to be conducive to the close interaction and the intense discussion necessary to promote fusion of these fields into a single discipline with a concerted vision of spoken content retrieval. The proceedings contain papers on a wide range of topics including vocabulary independent search, spoken term detection, combination of models/indexes, use of speech recognition lattices for search, segmentation, temporal analysis, benchmarking, exploitation of prosody, speech surrogates for user interfaces and multi-language collections. A workshop reprot has been published in ACM SIGIR Forum, issue December 2008.
This proceedings volume contains papers on topics that are currently gaining momentum in the speech search community, strengthened by their position in the intersection of information retrieval and speech recognition research. Together these papers cover a wide spectrum of research areas, including vocabulary independent search, spoken term detection, combination of models/indexes, use of speech recognition lattices for search, segmentation, temporal analysis, benchmarking, exploitation of prosody, speech surrogates for user interfaces and multi-language collections.
This paper reports on the setup and evaluation of robust speech recognition sys- tem parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life archive of news-related genres. Performance figures for this type of content are compared to figures for broadcast news test data.
The SIGIR Workshop on Searching Spontaneous Conversational Speech was held as part of the 2007 ACM SIGIR Conference in Amsterdam. The workshop program was a mix of elements, including a keynote speech, paper presentations and panel discussions. This brief report describes the organization of this workshop and summarizes the discussions.
In this paper the XML Information Retrieval System PF/Tijah is applied to retrieval tasks on large spoken document collections. The used example setting is the English CLEF-2006 CL-SR collection together with given English topics and self produced Dutch topics. The main findings presented in this paper are the easy way of adapting queries to use different kinds and combinations of metadata. Furthermore simple ways of combining different metadata kinds are shown to be beneficial in terms of mean average precision.
Bridging the semantic gap is one of the big challenges in multimedia information retrieval. It exists between the extraction of low-level features of a video and its conceptual contents. In order to understand the conceptual content of a video a common approach is building concept detectors. A problem of this approach is that the number of detectors is impossible to determine. This paper presents a set of 8 methods on how to combine two existing concepts into a new one, which occurs when both concepts appear at the same time. The scores for each shot of a video for the combined concept are computed from the output of the underlying detectors. The findings are evaluated on basis of the output of the 101 detectors including a comparison to the theoretical possibility to train a classifier on each combined concept. The precision gains are significant, specially for methods which also consider the chronological surrounding of a shot promising.
In this report we summarize our methods and results for the search tasks in TRECVID 2007. We employ two different kinds of search: purely ASR based and purely concept based search. However, there is not significant difference of the performance of the two systems. Using neighboring shots for the combination of two concepts seems to be beneficial. General preprocessing of queries increased the performance and choosing detector sources helped. However, for all automatic search components we need to perform further investigations.
The 'Radio Oranje' demonstrator shows an attractive multimedia user experience in the cultural heritage domain based on a collection of mono-media audio documents. It supports online search and browsing of the collection using indexing techniques, specialized content visualizations and a related photo database.
In this paper we discuss the speech activity detection sys- tem that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal with- out the use of predefined non-speech models. Because the sys- tem trains its models on-line, it is robust for handling out-of- domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%. Index Terms: speech activity detection
This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN. The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus.
Within the context of international benchmarks and collection specific projects, much work on spoken document retrieval has been done in recent years. In 2000 the issue of automatic speech recognition for spoken document retrieval was declared 'solved' for the broadcast news domain. Many collections, however, are not in this domain and automatic speech recognition for these collections may contain specific new challenges. This requires a method to evaluate automatic speech recognition optimization schemes for these application areas. Traditional measures such as word error rate and story word error rate are not ideal for this. In this paper, three new metrics are proposed. Their behaviour is investigated on a cultural heritage collection and performance is compared to traditional measurements on TREC broadcast news data.
Access to historical audio collections is typically very restricted: content is often only available on physical (analog) media and the metadata is usually limited to keywords, giving access at the level of relatively large fragments, e.g., an entire tape. Many spoken word heritage collections are now being digitized, which allows the introduction of more advanced search technology. This paper presents an approach that supports online access and search for recordings of historical speeches. A demonstrator has been built, based on the so-called Radio Oranje collection, which contains radio speeches by the Dutch Queen Wilhelmina that were broadcast during World War II. The audio has been aligned with its original 1940s manual transcriptions to create a time-stamped index that enables the speeches to be searched at the word level. Results are presented together with related photos from an external database.
The Proceedings contain the contributions to the workshop on Searching Spontaneous Conversational Speech organized in conjunction with the 30th ACM SIGIR, Amsterdam 2007. The papers reflect some of the emerging focus areas and cross-cutting research topics, together addressing evaluation metrics, segmentation methods, workflow aspects, rich transcription, and robustness.
The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the eectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and au- dio analysis can contribute to increased granularity of automatically ex- tracted metadata. A number of techniques will be presented, includ- ing the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applica- bility of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (comple- mentary) demonstrators for browsing broadcast news archives.
Work on expressive speech synthesis has long focused on the expression of basic emotions. In recent years, however, interest in other expressive styles has been increasing. The research presented in this paper aims at the generation of a storytelling speaking style, which is suitable for storytelling applications and more in general, for applications aimed at children. Based on an analysis of human storytellers' speech, we designed and implemented a set of prosodic rules for converting "neutral" speech, as produced by a text-to-speech system, into storytelling speech. An evaluation of our storytelling speech generation system showed encouraging results
We discuss the annotation procedure for mental state and emotion that is under development for the AMI (Augmented Multiparty Interaction) corpus. The categories that were found to be most appropriate relate not only to emotions but also to (meta-)cognitive states and interpersonal variables. The history of the development of the annotation scheme is briefly described. The discussion centers around the presentation of the procedure.
This paper discusses audio indexing tools that have been implemented for the disclosure of Dutch audiovisual cultural heritage collections. It explains the role of language models and their adaptation to historical settings and the adaptation of acoustic models for homogeneous audio collections. In addition to the benefits of cross-media linking, the requirements for successful tuning and improvement of available tools for indexing the heterogeneous A/V collections from the cultural heritage domain are reviewed. And finally the paper argues that research is needed to cope with the varying information needs for different types of users.
This report presents the University of Twente's first cross-language speech retrieval exper- iments in Cross-Language Evaluation Forum (CLEF). It describes the issues our contribution was focusing on, it describes the PF/Tijah XML Information Retrieval system that was used and it discusses the results for both the monolingual English and the Dutch-English cross- language spoken document retrieval (CL-SR) task. The paper concludes with an overview of future research plans.
With the 10th anniversary of the death of the Dutch novelist Willem Frederik Hermans (1921-1994), the Willem Frederik Hermans Institute initiated the set-up of a Willem Frederik Hermans portal. Here, all available information related to the Dutch novelist and his work can be consulted. A part of this portal was planned to be dedicated to a collection of spoken audio material. This report describes the search functionality that was attached to this collection by the Human Media Interaction Group of the University of Twente. This project (further refered to as the WFH project) can be regarded a case-study of the disclosure of an oral-history spoken word archive using audio mining technology.
We present the results of two trials testing procedures for the annotation of emotion and mental state of the AMI corpus. The first procedure is an adaptation of the FeelTrace method, focusing on a continuous labelling of emotion dimensions. The second method is centered around more discrete labeling of segments using categorical labels. The results reported are promising for this hard task.
The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the AMI (Augmented Multiparty Interaction) project. We present several techniques important to processing of this data and show the performance in terms of word error rates (WERs). An important aspect of transcription of this data is the necessary flexibility in terms of audio pre-processing. Real world systems have to deal with flexible input, for example by using microphone arrays or randomly placed microphones in a room. Automatic segmentation and microphone array processing techniques are described and the eect on WERs is discussed. The system and its components presented in this paper yield compettive performance and form a baseline for future research in this domain.
In this paper we describe the 2005 AMI system for the tran- scription of speech in meetings used for participation in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using dierent front-ends, the recognition process is based on a unified sys- tem architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and mini- mum word error rate decoding. In this paper we describe the system per- formance on the ocial development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve very competitive perfor- mance.
In this paper, a cross-media browsing demonstrator named InfoLink is described. InfoLink automatically links the content of Dutch broadcast news videos to related information sources in parallel collections containing text and/or video. Automatic segmentation, speech recognition and available meta-data are used to index and link items. The concept is visualized using SMIL-scripts for presenting the streaming broadcast news video and the information links
The application of automatic speech recognition in the broadcast news domain is well studied. Recognition perfor-mance is generally high and accordingly, spoken document re-trieval can successfully be applied in this domain, as demon-strated by a number of commercial systems. In other domains, a similar recognition performance is hard to obtain, or even far out of reach, for example due to lack of suitable training ma-terial. This is a serious impediment for the successful applica-tion of spoken document retrieval techniques for other data then news. This paper outlines our first steps towards a retrieval sys-tem that can automatically be adapted to new domains. We dis-cuss our experience with a recently implemented spoken docu-ment retrieval application attached to a web-portal that aims at the disclosure of a multimedia data collection in the oral history domain. The paper illustrates that simply deploying an off-the-shelf broadcast news system in this task domain will produce error rates that are too high to be useful for retrieval tasks. By applying adaptation techniques on the acoustic level and lan-guage model level, system performance can be improved con-siderably, but additional research on unsupervised adaptation and search interfaces is required to create an adequate search environment based on speech transcripts.
Whereas the growth of storage capacity is in accordance with widely acknowledged predic- tions, the possibilities to index and access the archives created is lagging behind. This is especially the case in the oral history domain and much of the rich content in these collections runs the risk to remain inaccessible for lack of robust search technologies. This paper addresses the history and development of robust audio indexing technology for searching Dutch spoken-word collections and compares Dutch audio indexing in the well-studied broadcast news domain with an oral-history case-study. It is concluded that despite significant advances in Dutch audio index- ing technology and demonstrated applicability in several domains, further research is indispensable for successful automatic disclosure of spoken-word collections. I. I NTRODUCTION The number of digital spoken-word collections is growing rapidly. Due to the ever declining costs of recording audio and video, and due to improved preservation technology huge data sets are created, both by professionals at various types of organisations and non-professionals at home and underway. Partly because of initiatives for retro- spective digitisation, data-growth is also a trend in historical archives. These archives deserve special attention because they represent cultural heritage: a type of content which is rich in terms of cultural value, but has a less obvious economical value. Spoken-word archives belong to the domain of what is often called oral history: recordings of spoken interviews and testimonies on diverging topics such as retrospective narratives, eye witness reports, historical site descriptions, and modern variants such as 'Podcasts' and so-called amateur (audio/video) news1.
On line available http://www.amiproject.org/
In this paper, ongoing work on the development of the speech recognition modules of MMIR environment for Dutch is described. The work on the generation of acoustic models and language models along with their current performance is presented. Some characteristics of the Dutch language and of the target video archives that require special treatment are discussed.
In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage of the amount of training data, of decompounding compound words and of different selection methods for proper names and acronyms are discussed.
De expertise van de Parlevink Groep van de Universiteit Twente op het gebied van Spoken Document Retrieval wordt ingezet voor het ontsluiten van de Nederlandstalige filmarchieven bij het Nederlands Audiovisueel Archief. Dit in het kader van ECHO, een Europees project ter ontwikkeling van een digitale bibliotheekservice voor historische films van grote nationale audiovisuele archieven.
This paper describes a first approach to improve re cognition performance of our hybrid large vocabulary continuo us speech recogniser for Dutch by using co-articulation rules on the phrase level. By applying these rules on the reference tra nscripts used for training the recogniser and by adding a set of spec ial temporary phones that later on will be mapped on the original phones, more robust models of phones that are typically confused a lot in speech recognition like /v/-/f/ and /s/-/z/, could be trained.
The amount of metadata attached to multimedia collections that can be used for searching is very much dependent on the available resources within the organizations that create or own the collections. Large national audiovisual institutions, such as Sound&Vision in The Netherlands,4 put a lot of effort in archiving their assets and they label collection items with at least titles, dates and short content descriptions (descriptive metadata, see Chapter 2). However, many organizations that create or own multimedia collections lack the resources to apply even the most basic form of archiving. Certain collections may become the stepchild of an archive — minimally managed, poorly preserved, and hardly accessible.