Figure 1 - uploaded by Keith Curtis
Content may be subject to copyright.
Source publication
We present a novel method for the automatic generation of video summaries of academic presentations using linguistic and paralinguistic features. Our investigation is based on a corpus of academic conference presentations. Summaries are first generated based on keywords taken from transcripts created using automatic speech recognition (ASR). We aug...
Contexts in source publication
Context 1
... the representative gaze plots in Figure 1 we can see that participants hold much higher levels of at- tention during summaries than for full presentations, with far less instances of them losing focus or look- ing around the scene, instead focussing entirely on the slides and speaker. The many small circles over the slides area represent a large number of smaller fixa- tions -indicating high engagement. ...
Context 2
... results of eye-tracking experiments per- formed in this study indicate that generated sum- maries tend to contain a higher concentration of rele- vant information than full presentations, as indicated by the higher proportion of time participants spend carefully reading slides during summaries than during full presentations, and also by the lower proportion of time spent fixating on areas outside of the attention zone during summaries than during full presentations. This can be seen from Table 2 and Figures 1, 2, 3 and 4. ...
Similar publications
This paper proposes a new method for weighting two dimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to
weight T...
Citations
... This paper revises and extends our earlier conference paper [7]. It is structured as follows: Section 2 introduces related work in video summarisation and describes the high level feature classification process. ...
... The temporal order of segments within summaries was preserved. Algorithm 1 below is taken from the conference paper [7]. ...
... The other half began by watching a presentation summary and finished by watching a full, separate presentation. Table 1 is taken from the conference paper [7], and shows the core values for eye-tracking results per video, version and scene. The videos are listed 1 to 4, with plen2 as video 1, prp1 as video 2, prp5 as video 3, and speechRT6 as video 4. Version is listed 1 to 2, where version 1 corresponds to the video summary, and version 2 to the full video. ...
We present a method for automatically summarising audio-visual recordings of academic presentations. For generation of presentation summaries, keywords are taken from automatically created transcripts of the spoken content. These are then augmented by incorporating classification output scores for speaker ratings, audience engagement, emphasised speech, and audience comprehension. Summaries are evaluated by performing eye-tracking of participants as they watch full presentations and automatically generated summaries of presentations. Additional questionnaire evaluation of eye-tracking participants is also reported. As part of these evaluations, we automatically generate heat maps and gaze plots from eye-tracking participants which provide further information of user interaction with the content. Automatically generated presentation summaries were found to hold the user’s attention and focus for longer than full presentations. Half of the evaluated summaries were found to be significantly more engaging than full presentations, while the other half were found to be somewhat more engaging.
... The work on audience comprehension, presented in chapter 5, was published in . Finally, the summarisation aspect of this work, presented in chapter 6, is published in the papers (Curtis et al., 2017b) and (Curtis et al., 2018b). Finally, details of the collection and annotation of the multimodal dataset used in this research and presented in chapter 3 of this thesis are published in (Curtis et al., 2018a). ...
Multimedia archives are expanding rapidly. For these, there exists a shortage of
retrieval and summarisation techniques for accessing and browsing content where the
main information exists in the audio stream. This thesis describes an investigation
into the development of novel feature extraction and summarisation techniques for
audio-visual recordings of academic presentations.
We report on the development of a multimodal dataset of academic presentations.
This dataset is labelled by human annotators to the concepts of presentation ratings,
audience engagement levels, speaker emphasis, and audience comprehension. We investigate the automatic classification of speaker ratings and audience engagement
by extracting audio-visual features from video of the presenter and audience and
training classifiers to predict speaker ratings and engagement levels. Following this,
we investigate automatic identification of areas of emphasised speech. By analysing
all human annotated areas of emphasised speech, minimum speech pitch and gesticulation are identified as indicating emphasised speech when occurring together.
Investigations are conducted into the speaker’s potential to be comprehended
by the audience. Following crowdsourced annotation of comprehension levels during
academic presentations, a set of audio-visual features considered most likely to affect
comprehension levels are extracted. Classifiers are trained on these features and
comprehension levels could be predicted over a 7-class scale to an accuracy of 49%,
and over a binary distribution to an accuracy of 85%.
Presentation summaries are built by segmenting speech transcripts into phrases,
and using keywords extracted from the transcripts in conjunction with extracted
paralinguistic features. Highest ranking segments are then extracted to build presentation summaries. Summaries are evaluated by performing eye-tracking experiments as participants watch presentation videos. Participants were found to be
consistently more engaged for presentation summaries than for full presentations.
Summaries were also found to contain a higher concentration of new information
than full presentations.