Table 3 - uploaded by Keith Curtis
Content may be subject to copyright.
Questionnaire Results -Likert scale.

Questionnaire Results -Likert scale.

Source publication
Conference Paper
Full-text available
We present a novel method for the automatic generation of video summaries of academic presentations using linguistic and paralinguistic features. Our investigation is based on a corpus of academic conference presentations. Summaries are first generated based on keywords taken from transcripts created using automatic speech recognition (ASR). We aug...

Contexts in source publication

Context 1
... Table 3 we show further evaluations between summaries built using all available features, and sum- maries built using just a subset of features. For audio- only summaries, classification of the paralinguistic features of Speaker Ratings, Audience Engagement, Emphasis, and Comprehension was performed as de- scribed in the earlier chapters but by using only audio features, with visual features not being considered. ...
Context 2
... Table 3, we can see that the results of audio- only classifications and visual-only classifications re- sult in summaries which are rated less easy to under- stand and informative than summaries built using full information and with no key words. Summaries built using no keywords also lack coherence, while sum- maries built using all available features score highly on helping users decide if they want to see full pre- sentations. ...

Similar publications

Conference Paper
Full-text available
This paper proposes a new method for weighting two dimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to weight T...

Citations

... This paper revises and extends our earlier conference paper [7]. It is structured as follows: Section 2 introduces related work in video summarisation and describes the high level feature classification process. ...
... The temporal order of segments within summaries was preserved. Algorithm 1 below is taken from the conference paper [7]. ...
... The other half began by watching a presentation summary and finished by watching a full, separate presentation. Table 1 is taken from the conference paper [7], and shows the core values for eye-tracking results per video, version and scene. The videos are listed 1 to 4, with plen2 as video 1, prp1 as video 2, prp5 as video 3, and speechRT6 as video 4. Version is listed 1 to 2, where version 1 corresponds to the video summary, and version 2 to the full video. ...
Chapter
We present a method for automatically summarising audio-visual recordings of academic presentations. For generation of presentation summaries, keywords are taken from automatically created transcripts of the spoken content. These are then augmented by incorporating classification output scores for speaker ratings, audience engagement, emphasised speech, and audience comprehension. Summaries are evaluated by performing eye-tracking of participants as they watch full presentations and automatically generated summaries of presentations. Additional questionnaire evaluation of eye-tracking participants is also reported. As part of these evaluations, we automatically generate heat maps and gaze plots from eye-tracking participants which provide further information of user interaction with the content. Automatically generated presentation summaries were found to hold the user’s attention and focus for longer than full presentations. Half of the evaluated summaries were found to be significantly more engaging than full presentations, while the other half were found to be somewhat more engaging.
... The work on audience comprehension, presented in chapter 5, was published in . Finally, the summarisation aspect of this work, presented in chapter 6, is published in the papers (Curtis et al., 2017b) and (Curtis et al., 2018b). Finally, details of the collection and annotation of the multimodal dataset used in this research and presented in chapter 3 of this thesis are published in (Curtis et al., 2018a). ...
Thesis
Full-text available
Multimedia archives are expanding rapidly. For these, there exists a shortage of retrieval and summarisation techniques for accessing and browsing content where the main information exists in the audio stream. This thesis describes an investigation into the development of novel feature extraction and summarisation techniques for audio-visual recordings of academic presentations. We report on the development of a multimodal dataset of academic presentations. This dataset is labelled by human annotators to the concepts of presentation ratings, audience engagement levels, speaker emphasis, and audience comprehension. We investigate the automatic classification of speaker ratings and audience engagement by extracting audio-visual features from video of the presenter and audience and training classifiers to predict speaker ratings and engagement levels. Following this, we investigate automatic identification of areas of emphasised speech. By analysing all human annotated areas of emphasised speech, minimum speech pitch and gesticulation are identified as indicating emphasised speech when occurring together. Investigations are conducted into the speaker’s potential to be comprehended by the audience. Following crowdsourced annotation of comprehension levels during academic presentations, a set of audio-visual features considered most likely to affect comprehension levels are extracted. Classifiers are trained on these features and comprehension levels could be predicted over a 7-class scale to an accuracy of 49%, and over a binary distribution to an accuracy of 85%. Presentation summaries are built by segmenting speech transcripts into phrases, and using keywords extracted from the transcripts in conjunction with extracted paralinguistic features. Highest ranking segments are then extracted to build presentation summaries. Summaries are evaluated by performing eye-tracking experiments as participants watch presentation videos. Participants were found to be consistently more engaged for presentation summaries than for full presentations. Summaries were also found to contain a higher concentration of new information than full presentations.