Conference PaperPDF Available

Summarising Academic Presentations using Linguistic and Paralinguistic Features

Authors:

Abstract and Figures

We present a novel method for the automatic generation of video summaries of academic presentations using linguistic and paralinguistic features. Our investigation is based on a corpus of academic conference presentations. Summaries are first generated based on keywords taken from transcripts created using automatic speech recognition (ASR). We augment spoken phrases by incorporating scores for audience engagement, comprehension and speaker emphasis. We evaluate the effectiveness of our summaries generated for individual presentations by performing eye-tracking evaluation of participants as they watch summaries and full presentations, and by questionnaire of participants upon completion of eye-tracking studies. We find that automatically generated summaries tend to maintain the user's focus and attention for longer, with users losing focus much less often than for full presentations.
Content may be subject to copyright.
Summarising Academic Presentations using Linguistic and
Paralinguistic Features
Keith Curtis1, Gareth J. F. Jones1and Nick Campbell2
1ADAPT Centre, School of Computing, Dublin City University, Dublin, Ireland
2ADAPT Centre, School of Computer Science & Statistics, Trinity College Dublin, Dublin, Ireland
{Keith.Curtis, Gareth.Jones}@dcu.ie, nick@tcd.ie
Keywords: Video Summarisation, Classication, Evaluation, Eye-Tracking.
Abstract: We present a novel method for the automatic generation of video summaries of academic presentations using
linguistic and paralinguistic features. Our investigation is based on a corpus of academic conference pre-
sentations. Summaries are rst generated based on keywords taken from transcripts created using automatic
speech recognition (ASR). We augment spoken phrases by incorporating scores for audience engagement,
comprehension and speaker emphasis. We evaluate the effectiveness of our summaries generated for indi-
vidual presentations by performing eye-tracking evaluation of participants as they watch summaries and full
presentations, and by questionnaire of participants upon completion of eye-tracking studies. We nd that au-
tomatically generated summaries tend to maintain the user’s focus and attention for longer, with users losing
focus much less often than for full presentations.
1 INTRODUCTION
Online archives of multimedia content are growing
rapidly. Every minute in 2016, 300 hours of new
video material was uploaded to YouTube, while 2.78
million videos were viewed. It is time consuming and
a growing challenge for users to be able to browse
content of interest in such multimedia archives - either
in response to user queries or in informal exploration
of content. The goal of this work is to provide an ef-
fective and efcient way to summarise audio-visual
recordings where the signicant information is pri-
marily in the audio stream.
Existing multimedia information retrieval re-
search has focused on matching text queries against
written meta-data or transcribed audio (Chechik et al.,
2008), in addition to seeking to match visual queries
to low-level features such as colours, textures, shape
and object recognition, plus scene type classication
- urban, countryside, or places etc. (Huiskes et al.,
2010). In the case of multimedia content where the
information is primarily visual, content can be repre-
sented by multiple keyframes extracted using meth-
ods such as object recognition and facial detection.
These are matched against visual queries, and re-
trieved videos shown to the user using keyframe sur-
rogates. Matching of the visual component of these
queries is generally complemented by text search
against a transcript of any available spoken audio and
any meta-data provided (Lew et al., 2006).
The above methods are limited to content with
signicant visual dimension or where spoken content
makes the subject clear or relevance is to some ex-
tent unambiguous. However, signicant amounts of
multimedia content does not have these features, e.g.
public presentations such as lectures largely focus on
a single speaker talking at length on a single topic or
meetings where multiple speakers discuss a range of
previously selected issues. In this research, we ex-
plore novel methods of summarising academic pre-
sentations, where most of the information exists in
the audio stream, using linguistic and paralinguistic
features, and evaluate the effectiveness of these auto-
matically generated summaries.
In this study we hypothesise that the classica-
tion of areas of audience engagement, speaker empha-
sis, and the speaker’s potential to be comprehended
by the audience, can be used to improve summari-
sation methods for academic presentations. We ad-
dress the following research question “Can areas of
special emphasis provided by the speaker, combined
with detected areas of high audience engagement and
high levels of audience comprehension, be used for
effective summarisation of audio-visual recordings of
presentations?” We evaluate this summarisation ap-
proach using eye-tracking and by questionnaire.
64
Curtis, K., Jones, G. and Campbell, N.
Summarising Academic Presentations using Linguistic and Paralinguistic Features.
In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 2: HUCAPP, pages
64-73
ISBN: 978-989-758-288-2
Copyright ©2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
This paper is structured as follows: Section 2 in-
troduces related work in video summarisation in addi-
tion to describing the classication of high level fea-
tures. Section 3 introduces the multimodal corpus
used for our experiments, while Section 4 describes
the procedure for creating automatic summaries. This
is followed by Section 5 which describes the evalua-
tion tasks performed and their results. Finally, con-
clusions are offered in Section 6 of the paper.
2 PREVIOUS WORK
This section looks at related work on summarisation
and skimming of audio-visual recording of academic
presentations.
(Ju et al., 1998) present a system for analysing
and annotating video sequences of technical talks.
They use a robust motion estimation technique to
detect key frames and segment the video into sub-
sequences containing a single slide. Potential gestures
are tracked using active contours. This rst automatic
video analysis system helps users to access presenta-
tion videos intelligently.
(He et al., 1999)s use prosodic information from
the audio stream to identify speaker emphasis dur-
ing presentations, in addition to pause information.
They develop three summarisation algorithms focus-
ing on: slide transition based summarisation, pitch ac-
tivity based summarisation and summarisation based
on slide, pitch and user-access information. In their
work they found that speaker emphasis did not pro-
vide sufcient information to generate effective sum-
maries.
(Joho et al., 2009) captures and analyses the user’s
facial expressions for the generation of perception-
based summaries which exploit the viewer’s affective
state, perceived excitement and attention. Perception-
based approaches are designed to overcome the se-
mantic gap problem in summarisation. Results sug-
gest that there are at least two or three distinguished
parts of videos that can be seen as the highlight by
various viewers.
Work in (Pavel et al., 2014) created a set of tools
for creating video digests of informational video. In-
formal evaluation suggests that these tools make it
easier for authors of informational talks to create
video digests. They found that video digests afford
browsing and skimming better than alternative video
presentation techniques.
We aim to extend these by classifying the most
engaging and comprehensible parts of presentations
and identifying emphasised regions within them be-
fore summarising them, to include highly rated re-
gions of such high-level concepts, in addition to key-
words taken from the transcript of the presentation.
2.1 High-Level Concept Classication
The novel video summarisation method reported in
this study incorporates the high level concepts of
audience engagement, emphasised speech and the
speakers potential to be comprehended. In this sec-
tion we overview previous work on the development
of these high-level concept detectors.
2.1.1 Classication of Audience Engagement
Prediction of audience engagement levels and apply-
ing ratings for ‘good’ speaking techniques was per-
formed by (Curtis et al., 2015). This was achieved
by rst employing human annotators to watch video
segments of presentations and to provide a rating of
how good that speaker was at presenting the mate-
rial. Audience engagement levels were measured in a
similar manner by having annotators watch video seg-
ments of the audience to academic presentations, and
providing estimates of just how engaged the audience
appeared to be as a whole. Classiers were trained on
extracted audio-visual features using an Ordinal Class
Classier.
It was demonstrated that the qualities of a ‘good’
speaker can be predicted to an accuracy of 73% over a
4-class scale. Using speaker-based techniques alone,
audience engagement levels can be predicted to an ac-
curacy of 68% over the same scale. By combining
with basic visual features from the audience as whole,
this can be improved to 70% accuracy.
2.1.2 Identication of Emphasised Speech
Identication of intentional or unintentional empha-
sised speech was performed by (Curtis et al., 2017a).
This was achieved by having human annotators label
areas of emphasised speech. Annotators were asked
to watch through short video clips and to mark areas
where they considered the speech to be emphasised,
either intentionally or unintentionally. Basic audio-
visual features of audio pitch and visual motion were
extracted from the data. From the analysis performed,
it was clear that speaker emphasis occurred during ar-
eas of high pitch, but also during areas of high visual
motion coinciding with areas of high pitch.
Candidate emphasised regions were marked from
extracted areas of pitch within the top 1, 5, and top 20
percentile of pitch values, in addition to top 20 per-
centile of gesticulation down to the top 40 percentile
of values respectively. All annotated areas of empha-
sis contained signicant gesturing in addition to pitch
Summarising Academic Presentations using Linguistic and Paralinguistic Features
65
with the top 20 percentile. Gesturing was also found
to take place in non-emphasised parts of speech, how-
ever this was much more casual and was not accom-
panied by pitch in the top 20 percentile.
2.1.3 Predicting the Speakers Potential to be
Comprehended
Prediction of audience comprehension was performed
by (Curtis et al., 2016). In this work, human annota-
tors were recruited through the use of crowd-sourcing.
Annotators were asked to watch each section of a pre-
sentation and to rst provide a textual summary of
the contents of that section of the presentation and
following this to provide an estimate of how much
they comprehended the material during that section.
Audio-visual features were extracted from video of
the presenter in addition to visual features extracted
from video of the audience, and OCR over the slides
for each presentation. Additional uency features
were also extracted from the speaker audio. Using the
above described extracted features, a classier was
trained to predict the speaker’s potential to be com-
prehended.
It was demonstrated that it is possible to build
a classier to predict potential audience comprehen-
sion levels, obtaining accuracy over a 7-class range
of 52.9%, and over a binary classication problem to
85.4%.
3 MULTIMODAL CORPUS
No standard publicly available dataset exists for work
of this nature. Since we require recordings of the
audience and of the speaker for academic presenta-
tions, for this work, we used the International Speech
Conference Multi-modal corpus (ISCM) from (Cur-
tis et al., 2015). This contains 31 academic pre-
sentations totalling 520 minutes of video, with high
quality 1080p parallel video recordings of both the
speaker and the audience to each presentation. We
chose four videos from this dataset for our evaluation
of our video summaries to ensure good coverage but
to avoid too much evaluations. Analysis of these four
videos showed them to be: the most engaging, the
least engaging, the most comprehensible videos, and
the video with highest presentation rankings, and they
were selected use in our summarisation study for this
reason.
4 CREATION OF PRESENTATION
SUMMARIES
This section describes the steps involved in the gen-
eration of presentation summaries. Summaries were
generated using ASR transcripts, signicant key-
words extracted form transcripts, and annotated val-
ues for ‘good’ public speaking techniques, audience
engagement, intentional or unintentional speaker em-
phasis and the speakers potential to be comprehended.
Human annotated values of these paralinguistic fea-
tures were used for summarisation for eye-tracking
experiments to ensure these summaries used the best
possible classications of these features. Numbered
sections here can be related to numbers appearing
in algorithm 1. Presentation summaries are gener-
ated to between 20% and 30% of original presentation
lengths for this evaluation.
1. Using the ASR transcripts, we use pause infor-
mation, which gives start and end times for each
spoken phrase during the presentation. This pro-
vides a basis for the separation of presentations at
the phrase-level.
2. We rst apply a ranking for each phrase based
on the number of signicant keywords extracted
form transcripts, or words of signicance, con-
tained within it. For the rst set of baseline sum-
maries, we generate summaries by using the high-
est ranking sentences.
Following this, the additional ranking is applied
to each sentence based on human annotations of
‘good’ speaking techniques.
3. Speaker Ratings are halved before applying this
ranking to each phrase. We half the values for
speaker ratings so as not to overvalue this feature,
as these values are already encompassed for clas-
sication of audience engagement levels.
4. Following this, audience engagement annotations
are also applied directly. We take the true an-
notated engagement level and apply this ranking
to each sentence contained within each segment
throughout the presentation.
5. As emphasis was annotated for all videos, we use
automatic classications for intentional or unin-
tentional speaker emphasis. For each classica-
tion of emphasis, we apply an additional ranking
of 1 to the phrase containing that emphasised part
of speech.
6. Finally, we use the human annotated values for
audience comprehension throughout the dataset.
Once again the nal comprehension value for
each segment is also applied to each sentence
HUCAPP 2018 - International Conference on Human Computer Interaction Theory and Applications
66
within that segment. For weightings, we choose to
half the Speaker Rating annotation, while choos-
ing to keep the original for other annotations, this
is to reduce the impact of Speaker Ratings on the
nal output. Points of emphasis receive a rank-
ing of 1, while keywords receive a ranking of 2,
in order to prioritise the role of keywords in the
summary generation process.
To generate the nal set of video summaries, the
highest ranking phrases in the set are selected. To
achieve this, the nal ranking for each phrase is
normalised to between 0 and 1.
7. By then assigning an initial threshold value of
0.9, and reducing this by 0.03 on each iteration,
we select each sentence with a ranking above
that threshold value. By calculating the length of
each selected sentence, we can apply a minimum
size to our generated video summaries. Final se-
lected segments are then joined together to gener-
ate small, medium and large summaries for each
presentation.
Algorithm 1: Generate Summaries.
for all 1Sentence Sdo
if S contains Keyword then
2SS+2
end if
Engagement E
SpeakerRating SR
Em phasis Es
Comprehenson C
3SS+E
4SS+SR/2
5SS+Es
6SS+C
end for
while Summary <length do
if ST hreshold then
7Summary S
end if
end while
5 EVALUATION OF VIDEO
SUMMARIES
Presentation summaries are only as effective as they
have been found to be by their target audience. In this
paper we provide a comprehensive evaluation of gen-
erated presentation summaries in order to discover the
effectiveness of this summarisation strategy. In this
regard, we carry out our study using an eye-tracking
system in which participants watch full presentations
and separate presentation summaries. From studying
the eye-movements and focus of participants we can
make inferences as to how engaged participants were
as they watched presentations and summaries.
Eye-tracking is performed for this evaluation be-
cause, as shown in previous work, an increased num-
ber of shorter xations is consistent with higher cog-
nitive activity (attention), while a reduced number
of longer xations is consistent with lower attention
(Rayner and Sereno, 1994). This allows us to under-
stand clearly whether generated summaries have any
effect on levels of attention / engagement of partici-
pants as they watch presentation summaries.
Questionnaires were also provided to participants
in order to discover how useful and effective the par-
ticipants considered the presentation summaries to be.
Also, by summarising using only a subset of all avail-
able features, we aimed to discover how effective the
individual features are by crowdsourcing a separate
questionnaire on presentation summaries generated
using subsets of available features. Features used for
this further evaluation were: full feature classica-
tions, visual only classications, audio only classi-
cations, and full feature classications with no key-
words.
5.1 Gaze-Detection Evaluation
For the eye-tracking, participants watched one
full presentation whilst having their eye-movements
tracked. Participants also watched a separate pre-
sentation summary, again whilst having their eye-
movements tracked. The question being addressed
here was whether or not participants retained attention
for longer periods to the presentations for summaries
than for full presentations, to test the hypothesis that
summaries were engaging and comprehensible.
The eye-tracking study was completed by a total
of 24 separate participants. As there were 4 videos
to be evaluated in total, eight different test condition
were developed, with 4 participants per test. This al-
lowed for full variation of the order in which partici-
pants watched the videos. Therefore, half of all par-
ticipants began by watching a full presentation and
nished by watching a summary of a separate presen-
tation. The other half began by watching a presenta-
tion summary and nished by watching a full, sepa-
rate presentation. This was to prevent any issues of
bias or fatigue from inuencing these results.
Table 1 shows the core values for eye-tracking
measurements per video, version and scene. The
videos are listed 1 to 4, with plen2 as video 1, prp1 as
video 2, prp5 as video 3, and speechRT6 as video 4.
Summarising Academic Presentations using Linguistic and Paralinguistic Features
67
Version is listed 1 to 2, where version 1 corresponds
to the video summary, and version 2 to the full video.
The overall scene is 1 and the attention scene - the
area around the slides and the speaker is 2. Measure-
ments obtained include: number of xations, mean
length of xations, total sum of xation lengths, per-
centage of time xated per scene, xation count and
number of xations per second.
Again, from Table 1, we can see that participants
consistently spend a higher proportion of the time x-
ating on the scene for summaries than for the full pre-
sentation video. This is repeated to an even larger
extent for Fixation Counts, where this gure is con-
sistently higher for summaries than for full presenta-
tions. Again, this is evidence of increased levels of
participant engagement for video summaries than for
full presentation videos.
We can see that the number of xations per second
is consistently higher for video summaries, while the
mean xation length is consistently shorter for sum-
maries. As previous work has shown, an increased
number of shorter xations is consistent with higher
cognitive activity (attention), while a reduced num-
ber of longer xations is consistent with lower atten-
tion (Rayner and Sereno, 1994). This shows that all
video summaries attract higher attention levels of par-
ticipants for summaries than for full presentations.
Table 2 shows a statistically signicant (p <0.05)
difference between summary and full versions of
video 1, for the number of xations per 100 sec-
onds. These results indicate that video 1 summary
is more engaging than the full presentation for video
1. For video 2, statistically signicant differences (p
<0.05) are observed in the average xation duration
per scene, and to a lesser, not statistically signicant,
extent in the xation count per 100 seconds. Partici-
pants still spend a higher proportion of their time x-
ating on the attention scene for summaries than for
full presentations.
Video 3 results show a large difference between
the two scene’s of the video, there is a statistically sig-
nicant (p <0.1) difference in the percentage of time
spent xating on the attention scene during the sum-
mary compared with full presentation. Video 4 shows
a statistically signicant difference between the sum-
mary and full versions, for the number of xations
per 100 seconds (p <0.05). This video also shows
a statistically signicant difference (p <0.1) for the
mean xation duration between full presentation and
the presentation summary. This indicates that users
found there to be a much higher concentration of new
information during the summary than the full version.
These differences can be inspected further by looking
back to Table 1.
5.2 Gaze Plots
In this section, we show gaze plots from our eye-
tracking study. Gaze plots are data visualisations
which can communicate important aspects of visual
behaviour clearly. By looking carefully at plots for
full and summary videos, the difference in attention
and focus for different video types becomes more
clearly dened. For each video, 4 representative gaze
plots are chosen, 2 on top from full presentations, and
2 below from summaries.
From the representative gaze plots in Figure 1 we
can see that participants hold much higher levels of at-
tention during summaries than for full presentations,
with far less instances of them losing focus or look-
ing around the scene, instead focussing entirely on the
slides and speaker. The many small circles over the
slides area represent a large number of smaller xa-
tions - indicating high engagement.
From the representative plots in Figure 2 we see
large improvements in summaries over the full pre-
sentations. While participants still lose focus on occa-
sion, and improvements from full presentations is not
as rened as for the previous video, large improve-
ments are still gained, with the vast majority of xa-
tions taking place over the presentation slides and the
speaker. For comparison, gaze plots for the full video
shows that xations tended to be quite dispersed.
From the representative plots in Figure 3 we see
how the number of occasions on which participants
lose focus is reduced, with a big improvement on full
presentations. Gaze plots show the difference for this
video much better than the statistical tests in the pre-
vious section do. For full presentations, xations are
very dispersed with large numbers of xations away
from the slides and speakers. Summaries show a
large improvement with a much reduced number of
instances of participants losing focus.
From the representative plots in Figure 4 we
can see that while summaries are imperfect, with in-
stances of participants losing attention, huge improve-
ments in attention and focus are made, although this
may depend on how engaging the videos were in the
rst place. While summaries for Video 4 (speechRT6)
still show some instances of participants losing focus,
the original full presentation was found to be the least
engaging video of the dataset. This is also noticeable
from gaze plots. The gaze plots show a high number
of xations away form the slides and presenters. Gaze
plots of summaries also show smaller xations than
full presentation gaze plots, which indicates higher
levels of engagement for presentation summaries, in
addition to the obvious position of these xations tak-
ing place predominantly over the slides and speakers.
HUCAPP 2018 - International Conference on Human Computer Interaction Theory and Applications
68
Table 1: Totals per video, version, scene.
Vid Version Scene Fixations mean xation time xated % xated xations F.P.S.
1 Summ Whole 432.5 0.492 203.986 94.438 432.625 2.00289
1 Summ Atten 417.37 0.495 199.17 92.208 417.375 1.93229
1 Full Whole 1580 0.582 889.486 92.079 1580 1.63561
1 Full Atten 1523 0.587 865.952 89.643 1523 1.5766
2 Summ Whole 311.87 0.695 209.284 95.129 311.875 1.41761
2 Summ Atten 293.25 0.71 201.996 91.816 293.25 1.33295
2 Full Whole 1153.12 0.709 780.864 89.446 1153.125 1.32087
2 Full Atten 1091.12 0.724 761.826 87.265 1091.125 1.24987
3 Summ Whole 224.37 0.62 135.407 91.491 224.375 1.51605
3 Summ Atten 193.87 0.694 130.091 87.899 193.875 1.30997
3 Full Whole 1076.75 0.641 643.995 83.963 1076.75 1.40385
3 Full Atten 891.37 0.8 656.5 85.593 891.375 1.16216
4 Summ Whole 406.37 0.431 169.388 89.624 406.375 2.15013
4 Summ Atten 385.5 0.466 174.105 92.119 385.5 2.03968
4 Full Whole 1591.25 0.536 832.177 88.908 1591.25 1.70005
4 Full Atten 1414.37 0.561 775.37 82.839 1414.375 1.51108
Figure 1: Plen 2 - Representative Gaze Plots.
Overall, gaze plots show many fewer instances of
participants losing focus from the presentation. Gaze
during summaries is primarily focussed on the pre-
sentation slides as users gain more new information,
with deviations from this usually reverting back to
the speaker. Also visible from gaze plots of Video
1 (plen2) and particularly Video 4 (speechRT6), is the
shorter xations (smaller circles) for summaries than
for full presentations. This can be seen more clearly
by looking back to the gures reported in Table 1.
5.3 Questionnaire Evaluation of
Summary Types
In the next part of our summary evaluation, we il-
licit questionnaire responses from participants who
watched summaries generated using all available fea-
tures or just a subset of available features. Differ-
ent summaries were generated using all available fea-
tures, audio features only, visual features only, or
audio-visual features with no keywords used. From
this we aimed to discover the importance of differ-
ent features on presentation summaries. A total of 48
participants watched the summaries and answered the
Summarising Academic Presentations using Linguistic and Paralinguistic Features
69
Table 2: Eye-tracking videos scene by version.
I J Variable Measure Diff Error Sig (pvalue)
Video 1
summ full FD.M Scheffe -0.09 0.06 0.163
summ full percent Scheffe 2.36 1.68 0.181
summ full FCp100 Scheffe 36.73 17.12 0.050
Video 2
summ full FD.M Scheffe -0.01 0.08 0.865
summ full percent Scheffe 5.68 2.41 0.033
summ full FCp100 Scheffe 7.04 14.51 0.516
Video 3
summ full FD.M Scheffe -0.02 0.09 0.813
summ full percent Scheffe 7.53 3.63 0.057
summ full FCp100 Scheffe 11.22 15.24 0.474
Video 4
summ full FD.M Scheffe -0.11 0.06 0.080
summ full percent Scheffe 0.72 3.62 0.846
summ full FCp100 Scheffe 45.01 15.27 0.011
Figure 2: prp 1 - Representative Gaze Plots.
questionnaire on each summary. Each video was eval-
uated by 12 participants in total. The order in which
participants watched summaries was also alternated
to avoid issues of bias in these results.
In Table 3 we show further evaluations between
summaries built using all available features, and sum-
maries built using just a subset of features. For audio-
only summaries, classication of the paralinguistic
features of Speaker Ratings, Audience Engagement,
Emphasis, and Comprehension was performed as de-
scribed in the earlier chapters but by using only audio
features, with visual features not being considered.
Similarly, for visual only summaries, classication of
these features was performed using only visual fea-
tures, with audio features not being considered. For
no keyword summaries, classication of these fea-
tures is performed and the only information excluded
were keywords. For Classify summaries, all available
information is used for generating summaries. For
this actual classication outputs are used rather than
ground truth human annotations for most engaging,
comprehensible parts of presentations. Results in this
table reect Likert-scale rankings of participants level
of agreement with each statement.
1. This summary is easy to understand.
2. This summary is informative.
3. This summary is enjoyable.
HUCAPP 2018 - International Conference on Human Computer Interaction Theory and Applications
70
Figure 3: prp 5 - Representative Gaze Plots.
Figure 4: speechRT6 - Representative Gaze Plots.
4. This summary is coherent.
5. This summary would aid me in deciding whether
to watch the full video.
From Table 3, we can see that the results of audio-
only classications and visual-only classications re-
sult in summaries which are rated less easy to under-
stand and informative than summaries built using full
information and with no key words. Summaries built
using no keywords also lack coherence, while sum-
maries built using all available features score highly
on helping users decide if they want to see full pre-
sentations. The purpose of summaries built using a
subset of available features was to evaluate the effec-
tiveness of individual features.
The results of eye-tracking experiments per-
formed in this study indicate that generated sum-
maries tend to contain a higher concentration of rele-
vant information than full presentations, as indicated
by the higher proportion of time participants spend
carefully reading slides during summaries than during
full presentations, and also by the lower proportion of
time spent xating on areas outside of the attention
zone during summaries than during full presentations.
This can be seen from Table 2 and Figures 1, 2, 3
and 4.
Summarising Academic Presentations using Linguistic and Paralinguistic Features
71
Table 3: Questionnaire Results - Likert scale.
Video Q1 Q2 Q3 Q4 Q5
plen2 Classify 2.625 3.75 3.125 3.4375 4.625
plen2 audio only 2.3125 3.5 2.4375 3.8125 4.8125
plen2 video only 3 4.625 2.4375 4.0625 4.75
plen2 no keywords 3.5625 4.8125 3.75 4.25 5.0625
prp1 Classify 2.875 4.3125 2.3125 3.8125 4.5
prp1 audio only 2.25 3.875 2.4375 2.9375 4.9375
prp1 video only 2.375 3.25 2 2.875 4.0625
prp1 no keywords 2.5625 3.8125 2.8125 3.875 5.0625
prp5 Classify 4.25 4.875 4.5 4.4375 4.8125
prp5 audio only 3.625 4.25 3.375 3.625 5.125
prp5 video only 4.25 4.5 4.4375 4.125 5.3125
prp5 no keywords 5.0625 5.0625 3.8125 4.5625 4.875
spRT6 Classify 2.875 4.125 2.625 3.875 5.4375
spRT6 audio only 2.8125 4.5625 2.5 4 5.0625
spRT6 video only 2 3.5 2.5 3.0625 5.125
spRT6 no keywords 2.6875 3.9375 2.4375 3.3125 4.5
Table 4: Levels of Agreement.
# Level of Agreement
1 Very Much Disagree.
2 Disagree.
3 Disagree Somewhat.
4 Neutral.
5 Agree Somewhat.
6 Agree.
7 Very Much Agree.
6 CONCLUSIONS
This paper describes our investigation into the sum-
marisation of academic presentations using linguis-
tic and paralinguistic features. Comprehensive eval-
uations of summaries are reported including eye-
tracking and the development of summaries using
subsets of available features and a questionnaire eval-
uation of these to discover the effects of individual
classication features on nal summaries.
The results of this study indicate that classication
of areas of engagement, emphasis and comprehension
is useful for summarisation. Although the extent of
its usefulness may depend on how engaging and com-
prehensible presentations were to begin with. Presen-
tations rated as not engaging tend to see bigger im-
provements in engagement levels of summaries than
presentations already rated as highly engaging.
Gaze plots show large improvements for sum-
maries. Results show increased xation counts with
reduced xation durations for summaries, conrm-
ing that users are more attentive for presentation
summaries. This difference is more pronounced
for videos not already classied as highly engaging
videos, backing up other results showing that the sum-
marisation process is more affective for videos which
have not already been classied as most engaging.
Our earlier studies (Curtis et al., 2017b) also support
these new results reported in this paper.
Questionnaire results on summaries built using a
subset of features show that audio-only classications
and visual-only classications result in summaries
which are rated less easy to understand and less in-
formative than summaries built using full informa-
tion and with no key words. Summaries built using
no keywords also lack coherence. Overall, these re-
sults are very promising and demonstrate the effec-
tiveness of this automatic summarisation strategy for
academic presentations.
In future work we aim to develop a conference
portal where user’s can view presentation summaries
developed on the y using the features described in
this paper. This portal will then allow for further eval-
uation of the effectiveness of these features over a
greater pool of participants. Future work will evaluate
the effectiveness of using more linguistic features for
summarisation.
ACKNOWLEDGEMENTS
This research is supported by Science Foun-
dation Ireland through the CNGL Programme
(Grant 12/CE/I2267) in the ADAPT Centre
(www.adaptcentre.ie) at Dublin City University.
HUCAPP 2018 - International Conference on Human Computer Interaction Theory and Applications
72
The authors would like to thank all participants who
took part in these evaluations. We would further like
to express our gratitude to all participants who took
part in our previous experiments for classication of
the high-level paralinguistic features discussed in this
paper.
REFERENCES
Chechik, G., Ie, E., Rehn, M., Bengio, S., and Lyon,
D. (2008). Large-scale content-based audio retrieval
from text queries. In Proceedings of the 1st ACM in-
ternational conference on Multimedia information re-
trieval, pages 105–112. ACM.
Curtis, K., Jones, G. J., and Campbell, N. (2015). Effects
of good speaking techniques on audience engagement.
In Proceedings of the 2015 ACM on International
Conference on Multimodal Interaction, pages 35–42.
ACM.
Curtis, K., Jones, G. J., and Campbell, N. (2016). Speaker
impact on audience comprehension for academic pre-
sentations. In Proceedings of the 18th ACM Interna-
tional Conference on Multimodal Interaction, pages
129–136. ACM.
Curtis, K., Jones, G. J., and Campbell, N. (2017a). Iden-
tication of emphasised regions in audio-visual pre-
sentations. In Proceedings of the 4th European and
7th Nordic Symposium on Multimodal Communica-
tion (MMSYM 2016), Copenhagen, 29-30 September
2016, number 141, pages 37–42. Link ¨
oping Univer-
sity Electronic Press.
Curtis, K., Jones, G. J., and Campbell, N. (2017b). Utilising
high-level features in summarisation of academic pre-
sentations. In Proceedings of the 2017 ACM on Inter-
national Conference on Multimedia Retrieval, pages
315–321. ACM.
He, L., Sanocki, E., Gupta, A., and Grudin, J. (1999). Auto-
summarization of audio-video presentations. In Pro-
ceedings of the seventh ACM international conference
on Multimedia (Part 1), pages 489–498. ACM.
Huiskes, M. J., Thomee, B., and Lew, M. S. (2010). New
trends and ideas in visual concept detection: the mir
ickr retrieval evaluation initiative. In Proceedings of
the international conference on Multimedia informa-
tion retrieval, pages 527–536. ACM.
Joho, H., Jose, J. M., Valenti, R., and Sebe, N. (2009).
Exploiting facial expressions for affective video sum-
marisation. In Proceedings of the ACM International
Conference on Image and Video Retrieval, page 31.
ACM.
Ju, S. X., Black, M. J., Minneman, S., and Kimber, D.
(1998). Summarization of videotaped presentations:
automatic analysis of motion and gesture. IEEE
Transactions on Circuits and Systems for Video Tech-
nology, 8(5):686–696.
Lew, M. S., Sebe, N., Djeraba, C., and Jain, R. (2006).
Content-based multimedia information retrieval: State
of the art and challenges. ACM Transactions on Mul-
timedia Computing, Communications, and Applica-
tions (TOMM), 2(1):1–19.
Pavel, A., Reed, C., Hartmann, B., and Agrawala, M.
(2014). Video digests: a browsable, skimmable for-
mat for informational lecture videos. In UIST, pages
573–582.
Rayner, K. and Sereno, S. C. (1994). Eye movements in
reading: Psycholinguistic studies. Handbook of psy-
cholinguistics, pages 57–81.
Summarising Academic Presentations using Linguistic and Paralinguistic Features
73
... This paper revises and extends our earlier conference paper [7]. It is structured as follows: Section 2 introduces related work in video summarisation and describes the high level feature classification process. ...
... The temporal order of segments within summaries was preserved. Algorithm 1 below is taken from the conference paper [7]. ...
... The other half began by watching a presentation summary and finished by watching a full, separate presentation. Table 1 is taken from the conference paper [7], and shows the core values for eye-tracking results per video, version and scene. The videos are listed 1 to 4, with plen2 as video 1, prp1 as video 2, prp5 as video 3, and speechRT6 as video 4. Version is listed 1 to 2, where version 1 corresponds to the video summary, and version 2 to the full video. ...
Chapter
We present a method for automatically summarising audio-visual recordings of academic presentations. For generation of presentation summaries, keywords are taken from automatically created transcripts of the spoken content. These are then augmented by incorporating classification output scores for speaker ratings, audience engagement, emphasised speech, and audience comprehension. Summaries are evaluated by performing eye-tracking of participants as they watch full presentations and automatically generated summaries of presentations. Additional questionnaire evaluation of eye-tracking participants is also reported. As part of these evaluations, we automatically generate heat maps and gaze plots from eye-tracking participants which provide further information of user interaction with the content. Automatically generated presentation summaries were found to hold the user’s attention and focus for longer than full presentations. Half of the evaluated summaries were found to be significantly more engaging than full presentations, while the other half were found to be somewhat more engaging.
... The work on audience comprehension, presented in chapter 5, was published in . Finally, the summarisation aspect of this work, presented in chapter 6, is published in the papers (Curtis et al., 2017b) and (Curtis et al., 2018b). Finally, details of the collection and annotation of the multimodal dataset used in this research and presented in chapter 3 of this thesis are published in (Curtis et al., 2018a). ...
Thesis
Full-text available
Multimedia archives are expanding rapidly. For these, there exists a shortage of retrieval and summarisation techniques for accessing and browsing content where the main information exists in the audio stream. This thesis describes an investigation into the development of novel feature extraction and summarisation techniques for audio-visual recordings of academic presentations. We report on the development of a multimodal dataset of academic presentations. This dataset is labelled by human annotators to the concepts of presentation ratings, audience engagement levels, speaker emphasis, and audience comprehension. We investigate the automatic classification of speaker ratings and audience engagement by extracting audio-visual features from video of the presenter and audience and training classifiers to predict speaker ratings and engagement levels. Following this, we investigate automatic identification of areas of emphasised speech. By analysing all human annotated areas of emphasised speech, minimum speech pitch and gesticulation are identified as indicating emphasised speech when occurring together. Investigations are conducted into the speaker’s potential to be comprehended by the audience. Following crowdsourced annotation of comprehension levels during academic presentations, a set of audio-visual features considered most likely to affect comprehension levels are extracted. Classifiers are trained on these features and comprehension levels could be predicted over a 7-class scale to an accuracy of 49%, and over a binary distribution to an accuracy of 85%. Presentation summaries are built by segmenting speech transcripts into phrases, and using keywords extracted from the transcripts in conjunction with extracted paralinguistic features. Highest ranking segments are then extracted to build presentation summaries. Summaries are evaluated by performing eye-tracking experiments as participants watch presentation videos. Participants were found to be consistently more engaged for presentation summaries than for full presentations. Summaries were also found to contain a higher concentration of new information than full presentations.
Conference Paper
Full-text available
Rapidly expanding archives of audiovisual recordings available online are making unprecedented amounts of information available in many applications. New and efficient techniques to access this information are needed to fully realise the potential of these archives. We investigate the identification of areas of intentional or unintentional emphasis during audiovisual presentations and lectures. We find that, unlike in audio-only recordings where emphasis can be located using pitch information alone, perceived emphasis can be very much associated with information from the visual stream such as gesticulation. We also investigate potential correlations between emphasised speech, and increased levels of audience engagement during audiovisual presentations and lectures.
Conference Paper
Full-text available
We present a novel method for the generation of automatic video summaries of academic presentations. We base our investigation on a corpus of multimodal academic conference presentations combining transcripts with paralinguistic multimodal features. We first generate summaries based on keywords by using transcripts created using automatic speech recognition (ASR). Start and end times for each spoken phrase are identified from the ASR transcript, then a value for each phrase created. Spoken phrases are then augmented by incorporating scores for human annotation of paralinguistic features. These features measure audience engagement, comprehension and speaker emphasis. We evaluate the effectiveness of summaries generated for individual presentations, created using speech transcripts and paralinguistic multimodal features, by performing eye-tracking evaluation of participants as they watch summaries and full presentations, and by questionnaire of participants upon completion of eye-tracking studies. Summaries were also evaluated for effectiveness by performing comparisons with an enhanced digital video browser.
Conference Paper
Full-text available
Understanding audience comprehension levels for presentations has the potential to enable richer and more focused interaction with audio-visual recordings. We describe an investigation into automated analysis and classification of audience comprehension during academic presentations. We identify audio and visual features considered to most aid audience understanding. To obtain gold standards for comprehension levels, human annotators watched contiguous video segments from a corpus of academic presentations and estimated how much they understood or comprehended the content. We investigate pre-fusion and post-fusion strategies over a number of input streams and demonstrate the most effective modalities for classification of comprehension. We demonstrate that it is possible to build a classifier to predict potential audience comprehension levels, obtaining accuracy over a 7-class range of 52.9%, and over a binary classification problem to 85.4%.
Conference Paper
Full-text available
Understanding audience engagement levels for presentations has the potential to enable richer and more focused interaction with audiovisual recordings. We describe an investigation into automated analysis of multimodal recordings of scientific talks where the use of modalities most typically associated with engagement such as eye-gaze is not feasible. We first study visual and acoustic features to identify those most commonly associated with good speaking techniques. To understand audience interpretation of good speaking techniques, we angaged human annotators to rate the qualities of the speaker for a series of 30-second video segments taken from a corpus of 9 hours of presentations from an academic conference. Our annotators also watched corresponding video recordings of the audience to presentations to estimate the level of audience engagement for each talk. We then explored the effectiveness of multimodal features extracted from the presentation video against Likert-scale ratings of each speaker as assigned by the annota-tors. and on manually labelled audience engagement levels. These features were used to build a classifier to rate the qualities of a new speaker. This was able classify a rating for a presenter over an 8-class range with an accuracy of 52%. By combining these classes to a 4-class range accuracy increases to 73%. We analyse linear correlations with individual speaker-based modalities and actual audience engagement levels to understand the corresponding effect on audience engagement. A further classifier was then built to predict the level of audience engagement to a presentation by analysing the speaker's use of acoustic and visual cues. Using these speaker based modalities pre-fused with speaker ratings only, we are able to predict actual audience engagement levels with an accuracy of 68%. By combining with basic visual features from the audience as whole, we are able to improve this to an accuracy of 70%.
Conference Paper
Full-text available
This paper presents an approach to aective video summari- sation based on the facial expressions (FX) of viewers. A fa- cial expression recognition system was deployed to capture a viewer's face and his/her expressions. The user's facial ex- pressions were analysed to infer personalised aective scenes from videos. We proposed two models, pronounced level and expression's change rate, to generate aective summaries us- ing the FX data. Our result suggested that FX can be a promising source to exploit for aective video summaries that can be tailored to individual preferences.
Article
Full-text available
Extending beyond the boundaries of science, art, and culture, content-based multimedia information retrieval provides new paradigms and methods for searching through the myriad variety of media over the world. This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques. Based on the current state of the art, we discuss the major challenges for the future.
Conference Paper
Full-text available
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic fea- tures. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We address the problem of handling generic sounds in- cluding a wide variety of sound effects, animal vocaliza- tions and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaus- sian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three meth- ods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and re- trieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magni- tude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future. NOTE: Please do not redistribution without permission
Article
Increasingly, authors are publishing long informational talks, lectures, and distance-learning videos online. However, it is difficult to browse and skim the content of such videos using current timeline-based video players. Video digests are a new format for informational videos that afford browsing and skimming by segmenting videos into a chapter/section structure and providing short text summaries and thumbnails for each section. Viewers can navigate by reading the summaries and clicking on sections to access the corresponding point in the video. We present a set of tools to help authors create such digests using transcript-based interactions. With our tools, authors can manually create a video digest from scratch, or they can automatically generate a digest by applying a combination of algorithmic and crowdsourcing techniques and then manually refine it as needed. Feedback from first-time users suggests that our transcript-based authoring tools and automated techniques greatly facilitate video digest creation. In an evaluative crowdsourced study we find that given a short viewing time, video digests support browsing and skimming better than timeline-based or transcript-based video players.
Article
In this chapter, we argue that one of the best ways to study language comprehension is to record subjects' eye movements as they read. We review the current literature using eye movements to uncover the perceptual and cognitive processes involved in fluent reading. The chapter is divided into the following sections: I. Introduction; II. Eye Movements - Some Basic Facts; III. Methodological Issues; IV. Word Processing; V. Syntactic Processing; VI. Discourse Processing; and VII. Summary.
Conference Paper
The MIR Flickr collection consists of 25000 high-quality photographic images of thousands of Flickr users, made available under the Creative Commons license. The database includes all the original user tags and EXIF metadata. Additionally, detailed and accurate annotations are provided for topics corresponding to the most prominent visual concepts in the user tag data. The rich metadata allow for a wide variety of image retrieval benchmarking scenarios. In this paper, we provide an overview of the various strategies that were devised for automatic visual concept detection using the MIR Flickr collection. In particular we discuss results from various experiments in combining social data and low-level content-based descriptors to improve the accuracy of visual concept classifiers. Additionally, we present retrieval results obtained by relevance feedback methods, demonstrating (i) how their performance can be enhanced using features based on visual concept classifiers, and (ii) how their performance, based on small samples, can be measured relative to their large sample classifier counterparts. Additionally, we identify a number of promising trends and ideas in visual concept detection. To keep the MIR Flickr collection up-to-date on these developments, we have formulated two new initiatives to extend the original image collection. First, the collection will be extended to one million Creative Commons Flickr images. Second, a number of state-of-the-art content-based descriptors will be made available for the entire collection.