Extrinsic Summarization Evaluation: A Decision
Gabriel Murray1, Thomas Kleinbauer2, Peter Poller2, Steve Renals3,
Jonathan Kilgour3, and Tilman Becker2
1University of British Columbia, Vancouver, Canada
2German Research Center for Artificial Intelligence, Saarbr¨ ucken, Germany
3University of Edinburgh, Edinburgh, Scotland
Abstract. In this work we describe a large-scale extrinsic evaluation of
automatic speech summarization technologies for meeting speech. The
particular task is a decision audit, wherein a user must satisfy a complex
information need, navigating several meetings in order to gain an under-
standing of how and why a given decision was made. We compare the
usefulness of extractive and abstractive technologies in satisfying this
information need, and assess the impact of automatic speech recogni-
tion (ASR) errors on user performance. We employ several evaluation
methods for participant performance, including post-questionnaire data,
human subjective and objective judgments, and an analysis of partici-
pant browsing behaviour.
In the field of automatic summarization, machine summaries are often evaluated
intrinsically, i.e., according to how well their information content matches the
information content of multiple reference summaries. A more comprehensive and
reliable evaluation of the quality of a given summary, however, is the degree to
which it aids a real-world extrinsic task: an indication not just of how informative
the summary is, but how useful it is in addressing a real information need. While
intrinsic evaluation metrics are indispensable for development purposes and can
be easily replicated, they ideally need to be chosen based on whether or not
they are good predictors for extrinsic usefulness, e.g. whether they correlate to
a measure of real-world usefulness.
We therefore design an extrinsic task that models a real-world information
need, create multiple experimental conditions and enlist subjects to participate
in the task. The chosen task is a decision audit, wherein a user must review
previously held meetings in order to determine how a given decision was reached.
This involves the user determining what the final decision was, which alternatives
had previously been proposed, and what the arguments for and against the
various proposals were. The reason this task was chosen is that it represents one
of the key applications for analyzing multimodal interactions - that of aiding
corporate memory, the storage and management of a organization’s knowledge,
transactions, decisions, and plans. A organization may find itself in the position
A. Popescu-Belis and R. Stiefelhagen (Eds.): MLMI 2008, LNCS 5237, pp. 349–361, 2008.
c ? Springer-Verlag Berlin Heidelberg 2008
350 G. Murray et al.
of needing to review or explain how it came to a particular position or why it
took a certain course of action. We hypothesize that this task will be made much
more efficient when meetings are archived and summarized.
The decision audit represents a complex information need that cannot be sat-
isfied with a simple one-sentence answer. Relevant information will be spread
throughout several meetings and may appear at multiple points in a single dis-
cussion thread. Because the decision audit does not only involve knowing what
decision was made but also determining why the decision was made, the per-
son conducting the audit will need to understand the evolution of the meeting
participants’ thinking and the range of factors that led to the ultimate decision.
Because the person conducting the decision audit does not know which meetings
are relevant to the given topic, there is an inherent relevance assessment task
built into this overall task. As time is limited, they cannot hope to scan the
meetings in their entirety and so must focus on which meetings and meeting
sections seem most promising.
2 Related Extrinsic Evaluation Work
This section describes previous extrinsic evaluations relating either to summa-
rization or to the browsing of multi-party interactions. We then describe how
our decision audit browsers fit into a typology of multi-media interfaces.
2.1 Previous Work
In the field of text summarization, a commonly used extrinsic evaluation has
been the relevance assessment task . In such a task, a user is presented with a
description of a topic or event and then must decide whether a given document
(e.g. a summary or a full-text) is relevant to that topic or event. Such schemes
have been used for a number of years and on a variety of projects [2, 3, 4]. Due
to problems of low inter-annotator agreement on such ratings, Dorr et. al 
proposed a new evaluation scheme that compares the relevance judgment of an
annotator given a full text with that same annotator given a condensed text.
Another type of extrinsic evaluation for summarization is the reading compre-
hension task [1, 6, 7]. In such an evaluation, a user is given either a full source
or a summary text and is then given a multiple-choice test relating to the full
source information. A system can then calculate how well they perform on the
test given the condition. This evaluation framework relies on the idea that truly
informative summaries should be able to act as substitutes for the full source.
In the speech domain, there have been several large extrinsic IR evaluations in
the past few years, though not necessarily designed with summarization in mind.
Wellner et. al  introduced the Browser Evaluation Test (BET), in which ob-
servations of interest are collected for each meeting, e.g. the observation “Susan
says the footstool is expensive.” Each observation is presented as both a positive
and negative statement and the user must decide which statement is correct by
browsing the meetings and finding the correct answer. It is clear that such a set-
up could be used to evaluate summaries and to compare summaries with other
Extrinsic Summarization Evaluation: A Decision Audit Task351
information sources. We choose not to use this evaluation paradigm, however,
because the observations of interest tend to be skewed towards a keyword search
approach, where it would always be simpler just to search for a word such as
“footstool” rather than read a summary.
The Task-Based Evaluation (TBE)  evaluates multiple browser conditions
containing various information sources relating to a series of meetings. Partici-
pants are brought in four at a time and are told that they are replacing a previous
group and must finish that group’s work. In essence, the evaluation involves re-
running the final meetings of the series with new participants. The participants
are given information related to the previous group’s initial meetings and must
finalize the previous group’s decisions as best as possible given what they know.
There are several reasons we have chosen not to use the TBE for this summa-
rization evaluation. One is that the TBE relies primarily on post-questionnaire
answers for evaluation. While we do incorporate post-questionnaires in our eval-
uation, we are also very interested in the objective participant performance in
the task and browsing behaviour during the task. Two, the TBE is more costly
to run than our decision audit task, as it requires having groups of four people
spend an afternoon reviewing previous meetings and conducting their own meet-
ings, which are also recorded, whereas the decision audit is an individual task.
The SCANMail browser [10, 11] is an interface for managing and browsing
voicemail messages, with multi-media components such as audio, ASR tran-
scripts, audio-based paragraphs, and extracted names and phone numbers. To
evaluate the browser and its components, the authors compared the SCANMail
browser to a state-of-the-art voicemail system on four key tasks: scanning and
searching messages, extracting information from messages, tracking the status
of messages (e.g. whether or not a message has been dealt with), and archiving
messages. Both in a think-aloud laboratory study and a larger field study, users
found the SCANMail system outperformed the comparison system for these ex-
trinsic tasks. The field study in particular yielded several interesting findings. In
24% of the times that users viewed a voicemail transcript with the SCANMail
system, they did not resort to playing the audio. This testifies to the fact that the
transcript and extracted information can, to some degree, act as substitutes for
the signal, which user comments also back up. On occasions when users did play
the audio, 57% of the time they did not play the entire audio. Most interestingly,
57% of the audio play operations resulted from clicking within the transcript.
The study also found that users were able to understand the transcripts even
with recognition errors, partly by having prior context for many of the messages.
Whittaker et. al  described a task-oriented evaluation of a browser for
navigating meeting interactions. The browser contains a manual transcript, a
visualization of speaker activity, audio and video streams with play, pause and
stop commands, and artefacts such as slides and whiteboard events (the slides,
but not the whiteboard events, are indices into the meeting record). Users
were given two sets of questions to answer, the first set consisting of general
“gist” question about the meeting, and the second set comprised of questions
about specific facts within the meeting. There were 10 questions in total to be
352 G. Murray et al.
answered. User responses were subsequently scored on correctness compared with
model answers. While general performance was not high, users found it much
easier to answer specific questions than “gist” questions using this browser setup.
This has special relevance for our work, as certain types of information needs
might be easily satisfied without recourse to derived data such as summaries or
topic segments, but getting the general gist of the meeting seems to be much
more difficult. Very interestingly, users often felt that they had performed much
better than they actually had. Specifically, users seemed to be unaware that
they had missed relevant or vital information and felt that they had provided
comprehensive answers. Across the board, participants focused on reading the
transcript rather than beginning with the audio and video records directly.
2.2 Multimodal Browser Types
Tucker and Whittaker  provided an overview of the mechanisms available
for browsing multimodal meetings. They established a four-way browser clas-
sification: audio-based browsers, video-based browsers, artefact-based browsers,
and derived data browsers. In light of this classification scheme, our decision
audit browsers are video browsers incorporating derived data forms. Although
other incarnations of our browsers contain meeting artefacts such as slides, we
simplify the browsers as much as possible for this task by putting the focus
on derived data forms and their usefulness for browsing the meeting records.
Each version of the experimental browser is built using JFerret , an easily
modifiable multi-media browser framework1.
3 Task Overview
The experiment consists of five different conditions, described below. We re-
cruited 10 subjects per condition for a total of 50 subjects, all native speakers of
English. For each condition, 6 participants were run in Edinburgh and 4 were run
at Saarbr¨ ucken, the experimental setups for the two locations being as identical
As our underlying data we chose four meetings from the AMI Meeting Corpus
. The meeting series ES2008 was selected because the participant group in
that series worked well together on the task of designing a new remote control.
The group took the task seriously and exhibited deliberate and careful decision-
making processes in each meeting and across the meeting series as a whole.
The basic task for the participants was to write a summary of the decision
making process in the meetings for separating often and rarely used functions
of the remote control. This particular information need was chosen because the
relevant discussion manifested itself throughout the 4 meetings, and the group
went through several possibilities before designing an eventual solution to this
portion of the design problem. A participant in the decision audit task therefore
Extrinsic Summarization Evaluation: A Decision Audit Task 353
would have to consult each meeting to be able to retrieve the full answer to the
task’s information need.
Each participant in our task was first given general instructions explaining
the meeting browser used in the experiment, the specific information need they
were meant to satisfy in the task, and a notice of the allotted time, 45 minutes,
which included both searching for the information and writing up the answer.
This amount of time was based on the result of an individual pilot task for
Condition EAM (s. 3.1). After reading the task instructions, each participant
is briefly shown how to use the browser’s various functions for navigating and
writing in the given experimental condition. They are then given several minutes
to familiarize themselves with the browser using unrelated meeting data, until
they state that they were comfortable and ready to proceed.
3.1 Experimental Conditions
There are five conditions run in total: one baseline condition, two extractive
conditions and two abstractive conditions, all of which come with audio/video
recordings and either a manual or automatic meeting transcript. Table 1 lists the
experimental conditions. The three-letter ID for each condition corresponds to
keywords/extracts/abstracts, automatic/semi-automatic/manual algorithms,
and automatic/manual transcripts.
Table 1. Experimental Conditions
Top 20 keywords
Extractive summary of manual transcripts
Extractive summary of ASR transcripts
The baseline condition,
Condition KAM, consists of
a browser with manual tran-
scripts and a list of the top 20
keywords in the meeting. The
keywords are determined au-
tomatically using su.idf .
Though this is a baseline con-
dition, the fact that it utilizes manual transcripts gives users in this condition
a possible advantage over users in conditions with ASR. In this respect, it is a
challenging baseline. There are other possibilities for the baseline, but we choose
the top 20 keywords because we are interested in comparing different forms of
derived content from meetings, and because a facility such as keyword search
would likely be problematic for a participant who is uncertain of what to search
for because they are unfamiliar with the meetings.
Condition AMM is the gold-standard condition, a human-authored abstrac-
tive summary. Each summary is divided into subsections: abstract, actions,
decisions and problems. Because of the distinct “decisions” subsection, this is
considered a challenging gold-standard to match for a decision audit task.
Conditions EAM and EAA present the user with an extractive summary of
each meeting, with the difference between the conditions being that the latter
is based on ASR and the former on manual transcripts. Condition EAA is the
only experimental condition using ASR output. These summaries were gener-
ated by training a support vector machine (SVM) with an RBF kernel on the
AMI training data, using 17 features from five broad feature classes: prosodic,