Content uploaded by Olivier Galibert
Author content
All content in this area was uploaded by Olivier Galibert
Content may be subject to copyright.
Methodologies for the evaluation of Speaker Diarization and Automatic
Speech Recognition in the presence of overlapping speech
Olivier Galibert
Laboratoire national de m´
etrologie et d’essais, Trappes, France
Olivier.Galibert@lne.fr
Abstract
Speaker Diarization and Automatic Speech Recognition
have been a topic of research for decades. Evaluating the devel-
oped systems has been required for almost as long. Following
the NIST initiatives a number of metrics have become standard
to handle these evaluations, namely the Diarization Error Rate
and the Word Error Rate.
The initial definitions of these metrics and, more impor-
tantly, their implementations, were designed for single-speaker
speech. One of the aims of the OSEO Quaero and the ANR
ETAPE projects was to investigate the capabilities of Diariza-
tion and ASR systems in the presence of overlapping speech.
Evaluating said systems required extending the metrics def-
initions and adapting the algorithmic approaches required for
their implementation. This paper presents these extensions and
adaptations and the open tools that provide them.
Index Terms: evaluation metrics, speaker diarization, auto-
matic speech recognition, overlapping speech, robustness
1. Introduction
Two useful technologies in the handling of speech are Diariza-
tion systems, where one has to group together speech segments
of the same speaker, and Transcription systems where what is
said has to be transcribed. Evaluating the output of such sys-
tems is helpful not only to compare the quality of systems but
also to assess whether a particular approach allows to get better
results. As a result evaluation has always been one of the tools
required to be able to develop better systems.
The NIST has long identified that need for evaluation
and has organized numerous evaluations of these technologies
among others. By doing that they established what has become
the standard metrics of these domains, the Word Error Rate in
1987 [1] and the Diarization Error Rate in 2000 [2]. These met-
rics have their limits and alternatives are proposed from time to
time but system developers always tend to come back to these
tried and true ones.
Within the Quaero [3] and Etape [4] we decided to evaluate
diarization and transcription systems in somewhat more difficult
conditions. In particular two specifics impacted the evaluation
process:
•Cross-show conditions, where speakers have to be de-
tected when recurring in different shows
•Overlapping speech, where multiple speakers speak si-
multaneously and must be both detected and transcribed
The existing NIST tools were reaching their limits under
these conditions, and we had to both generalize the metrics def-
initions and rethink the basis of their application to make them
usable under the new context and stay comparable with previous
results.
This paper then describes both the generalization of the
high-level metrics used in these two areas, but also the algo-
rithmic underpinnings to their application in real tools.
2. Speaker Diarization
2.1. The task
Speaker Diarization, also called ”who spoke when”, is the pro-
cess of identifying, for each speaker of an input audio recording,
all the regions where he/she is talking. Each temporal region
containing speech should be labeled with at least one speaker-
tag and segments from the same speaker shall be labeled with
the same tag. Speaker tags are not identities but abstract labels.
As such, diarization technology is not a final applicative system,
but is instead a first step towards full speaker identification or,
alternatively, the segmentation and clusterization step preceding
an automatic speech recognizer.
2.2. Diarization Error Rate
The main metric for diarization performance measurement is
the Diarization Error Rate. It has been introduced by the NIST
in 2000 within the Speaker Recognition evaluation [2] for their
then-new speaker segmentation task. The metric is computed in
two steps: the first step is to establish a mapping between the
speaker tags provided by the system and the speaker identities
found in the reference. The second step then computes the error
rate using that mapping.
Computing an error rate requires defining what the errors
can be. Three error types are defined in the diarization context:
•The confusion error, when the system-provided speaker
tag and the reference do not match through the mapping.
•The miss error, when speech is present in the reference
but no speaker is present in the hypothesis.
•The false alarm error when speech has incorrectly been
detected by the system.
These errors happen on segments of speech. The dura-
tions of the segments are summed together, giving us a time
in error. Note that the miss and false alarm cases are slightly
more complex when overlapping speech happens: false alarm
or miss time is present when more (respectively less) speakers
are present in the hypothesis than in the reference in a given
time interval, it’s not all confusion.
To get an error rate that can be compared between evalua-
tions with different speech durations, the time in error must be
divided by a normalization value. Multiple methods can define
that normalization values, the NIST-proposed one follows two
principles:
•The normalization value should not depend on the sys-
tem hypothesis, but only on the reference.
•An empty hypothesis should have an error rate of 1.
The first principle is useful to avoid weird system tuning
effects where increasing the divider can become more important
that reducing the time in error. The second is an easy way to
ensure a decent comparability between results. In our case with
an empty hypothesis all the reference speech time ends up in the
miss category, with the other two empty. That gives us the final
DER definition as:
DER =confusion + miss + false alarm
total reference speech time (1)
That equation leaves open the question of the mapping. Of
all the possible mappings the one chosen is the one given the
lowest error rate.
A last point in building that metric is to take into account the
intrinsic human imprecision. It is very hard to define a precise
point in time when speech starts or stops, especially when back-
ground noise or overlapping speech is present. So some flexibil-
ity must be given to the reference speech boundaries. The NIST
approach to cope with that problem is to remove from scoring an
interval of +/- 250ms around every reference speech boundary.
While this is a good solution for broadcast news type speech
where speech turns are long, we found out that on more active
debate-type shows, where overlapped speech happens naturally,
up to 40% of the speech was removed from the evaluation. So
we had to use a different method. We simply decided that for
two speakers mapped together there could not be any miss or
false alarm errors in a +/- 250ms interval around the reference
speaker speech boundaries. That method has two advantages:
•The tolerance only happens for the reference speaker
that starts or stops speaking and its associated hypoth-
esis speaker and does not influence other speakers that
are not in such a transition at the same time.
•When a reference speaker is not mapped all his time is
missed, avoiding surprising results for an empty hypoth-
esis.
2.3. Algorithmic approach to the computation of the metric
The NIST tools available to compute the DER metric use an
heuristic to build the best possible mapping. They work quite
well for broadcast news-type situations but sometimes fail when
a lot of overlapping speech happens. They also often fail
in cross-show conditions, where identical speakers have to be
identified when they are present in different audio files. In the
even more extreme cases of video diarization the equivalent of
overlapping speech is having multiple persons on screen. That
situation happens quite often in TV shows, and the tools just
end up running in a seemingly infinite loop. So a more robust
approach was needed.
The problem is to find a mapping that minimizes the to-
tal error. The error is built of the miss error, the false alarm
error and the confusion error. The amount of miss error and
false alarm error is almost independent of the mapping. One
just has to find the intervals when the number of speakers in
the reference and in the hypothesis are different and add that
interval’s duration for every extra speaker. The “almost” stems
from the fuzzy frontier handling. Having two speakers associ-
ated can slightly decrease the amount in error. Similarly, the
confusion time is easy to compute when no speaker are associ-
ated between reference and hypothesis: for every interval time
in confusion is added for every potential speaker association,
or, in mathematical terms, the interval time multiplied by the
minimum of the number of speakers in the reference and in the
hypothesis. When two speakers are associated the time in con-
fusion is reduced by the amount of time in which the reference
and hypothesis speakers are both active.
So the total error can be split into a base error which is in-
dependent of the mapping and is computed when no speakers
are mapped together, and a delta error that reduces it thanks to
the mapping. Optimizing the error rate through the mapping
is equivalent to maximizing the delta error. Said delta has an
important property: it can be decomposed as a sum of individ-
ual, per-association values. If one takes a given mapping with
a number of unassociated speakers, and two more speakers are
associated together, the delta error is increased by:
•The amount of time in common between the speakers,
which is a confusion reduction.
•The amount of time when only one of the speakers
is present in the 250ms intervals around the reference
speaker speech frontiers, which is a miss or false alarm
reduction.
Neither of these two values is dependent of the rest of the map-
ping.
We can then reformulate the mapping problem into first
computing the individual per-association delta errors, then find
a set of association that maximizes their sum. Defining:
•Rand Has the sets of speakers of the reference and the
hypothesis.
•∆Er,h, r ∈R, h ∈Has the delta error for the asso-
ciation between speaker rof the reference and hof the
hypothesis.
•A={(r1, h1), ..., (rn, hn)}a mapping, such that
∀(r, h)∈A, ∀(r0, h0)∈A, r =r0⇒h=h0
•Mthe set of all possible mappings.
We are trying to compute:
mapping = argmaxA∈MX
(r,h)∈A
∆Er,h (2)
That computation is a good example of an Assignment
Problem, which is solved deterministically in o(n3)time with
the Hungarian Algorithm. It is interesting to note that the NIST
has identified that algorithm to be appropriate for the mapping
estimation problem but does not seem to actually have used it
in its tools.
Once the mapping established computing the final score
poses no particular difficulties.
2.4. Evaluation experiments
We applied this evaluation methodology in a number of evalua-
tions, namely within Quaero, Etape and the ANR/DGA project
Repere [5] on the full identification of speakers in TV shows. A
very recent evaluation in that last project gives the best systems
a performance of 11% DER in single-show diarization and 14%
DER in cross-show diarization on a set of 3 hours of TV shows
extracts. These shows include debates with overlapping speech,
with in addition a mix of recurring and non-recurring speakers
between shows. As a comparison, not clustering speakers be-
tween the shows gives a cross-show DER in the high 50%.
These figures indicates that the systems are quite efficient
in their tasks, including the cross-show one. This is also con-
firmed in practice, with the diarization being one of the main
information sources for the Repere global task of fully identify-
ing speakers in multiple shows.
2.5. Conclusion
The diarization error is still a pertinent metric to measure the
quality of a diarization system in more complex setups includ-
ing:
•Cross-show diarization, when recurring speakers in mul-
tiple shows have to be recognized.
•Overlapping speech.
The main difficulty of implementing that metric is estab-
lishing the speaker mapping between reference speakers IDs
and hypothesis speaker tags. Reformulating the problem as an
assignment problem allows for a deterministic, time-bounded
solution with guaranteed optimality. In addition the narrower
way the error tolerance at reference speech frontiers is inter-
preted makes it usable in these new contexts where otherwise
too much speech time could not be evaluated.
3. Automatic Speech Recognition
3.1. Word Error Rate
The Word Error Rate (WER) is kept as the primary evaluation
metric for this evaluation. That metric basically counts the num-
ber of word deletions, insertions and substitutions in the output
of the automatic transcription system compared to a reference
transcription produced by humans.
More precisely, the word error rate can be computed as
shown in Equation 3:
W ER =S+D+I
N(3)
where:
•S is the number of substitutions,
•D is the number of the deletions,
•I is the number of the insertions,
•N is the number of words in the reference transcription.
While simple in theory, applying this metric has some sub-
tleties, related to ambiguities in the language. In languages
where not all letters are pronounced entire text spans can have
multiple ways to be interpreted and hence written down (in
French for instance that case sometimes happens with genders
and/or plurals). Also, some words or expressions can have mul-
tiple ways of being written, like the English contractions. As a
result, the comparison is in practice not of two sentences but of
two direct acyclic graphs of words.
3.2. Word Error Rate computation
Computing the WER necessitates producing the best alignment
between the reference and hypothesis graphs where two paths
are found in the graphs and most of the words on these paths
are associated together. The associated words are either cor-
rect, where they’re identical, or substitutions where they’re not,
and the unassociated words are insertions on the hypothesis
side and deletions on the reference side. The best alignment
is defined as being the one with the lowest number of substitu-
tions+insertions+deletions, and, in case of equality, the highest
number of words in the reference graph path. Multiple align-
ments are almost always possible but following the definition
of optimality give the same final score, so choosing one of them
is enough for scoring purposes, even if it may not be optimal
from a diagnostic point of view.
Establishing the best alignment is a long solved problem
using dynamic programming. Both graphs are sorted in a topo-
logical order and a 2D grid of scores is built using the flattened
graphs as X and Y axis, putting the words in the transition from
one line or column to the next. Scores are computed incremen-
tally in every slot of the grid by trying multiple possible steps:
•Advance in both reference and hypothesis, giving a cor-
rect or substitution.
•Advance in the reference graph only giving a deletion.
•Advance in the hypothesis graph only, giving an inser-
tion.
The best step, i.e. giving the smaller number of errors or,
in case of equality, the largest number of reference words, is
chosen at each grid slot. When multiple branches reach the
same slot their scores are compared to only keep the best one.
The best score, and by backtracking the best alignment, is then
present in the last slot.
3.3. Speaker-attributed word error rate
The Word Error Rate computation we just presented requires
hypothesis words to be attributed to individual speaker turns.
When overlapping speech is not present it is naturally done by
associating hypothesis words to the speaker turn in which they
temporally fall. Words that do not fall in a speaker turn are
insertions. But that is, of course, not directly possible in the case
of overlapping speech since, by definition, multiple speakers
turns may exist for the same point in time.
The most immediate way to solve this issue is to ask of the
ASR systems to provide a speaker-tag, per the diarization defi-
nition, with every word. A diarization-type mapping between
speakers of the reference and speaker tags of the hypothesis
is built using the overlapping-speech compatible methodology
presented in the section 2. Once this mapping established hy-
pothesis words are only attributed to speaker turns where the
speaker matches, removing any ambiguity. A standard WER
computation can then be done.
That approach works but has two problems. The first is ask-
ing the systems to provide speaker tags. While such a request
makes perfect sense from an applicative point of view, espe-
cially when coupled with a speaker identification system, not all
laboratories have an integrated enough system to be able to pro-
vide them without massive refactoring. The second point, more
problematic, is that a speaker error is handled in an extremely
harsh manner: an otherwise correct word becomes a deletion
in the correct speaker’s turn and an insertion in the wrong one.
And that even in all the parts when no overlapping speech is
present. As a result the importance given to the quality of the
diarization becomes probably too high compared to the quality
of the ASR.
3.4. Optimally speaker-attributed word error rate
To make the speaker attribution less of an issue an alternative
approach is to have the evaluation tool make the attribution it-
self. An additional dimension is added to the alignment cre-
ation, choosing which speaker the word should be associated to.
The approach stays the same: of all possible alignments, includ-
ing speaker associations, the one giving the best score is cho-
sen. Thankfully, the dynamic programming methodology can
be generalized to make the attribution sub-problem tractable.
Instead of having one hypothesis graph and one reference
graph we then have one hypothesis graph and multiple refer-
ence graphs, one for each speaker present in the time interval
evaluated. The two-dimensional grid is generalized to a n-
dimensional box where nis the number of reference speakers
plus one. The possible steps then are generalized:
•Advance in one of the references and the hypothesis, giv-
ing a correct or substitution, and assign the hypothesis
word to the chosen reference speaker. This is only possi-
ble if the previously assigned word to that speaker does
not overlap the current one.
•Advance in one of the reference graphs, giving a dele-
tion.
•Advance in the hypothesis graph, giving an insertion.
In addition the same branch folding happen when multiple
branches end up in the same slot. Backtracking gives both the
alignment and the word attribution. Insertions are not assigned
to a specific reference speaker.
3.5. Speaker-attributed with confusion
Optimal attribution works well, but explicit speaker tags are
interesting from an applicative point of view. So an evalua-
tion methodology taking these tags into account but less harsh
than directly separating the words in independent turns is use-
ful. A cross of the two previous evaluation methodologies can
be built which is pertinent for that case. First the diarization-
inspired mapping is built as for the speaker-attributed evalua-
tion. Then the optimally speaker-attributed evaluation is done
with an added error subtype to the correct/substitution case,
wrong speaker. That is, the impact on the score of advancing
on both one reference and the hypothesis includes a speaker
comparison component, where a cost is attributed to having the
wrong speaker. The value of said cost can be fixed depending
on how important the correct speaker repartition is for the ap-
plication. The rest of the methodology works identically.
3.6. Evaluation experiments
These metrics were tried within the Etape evaluation and while
the results are not official, hence publishable, yet, it could be
seen that for a system at 20% WER on non-overlapping speech
the optimized, non-speaker attributed error rate was at 30-35%
which, in practice, corresponds to a roughly 100% error rate on
the overlapping speech parts.
The speaker attributed variant in the same condition is ex-
tremely rough, with a 50%+ error rate, showing the damage a
wrong speaker cluster can do to the non-overlapping speech.
Adding the confusion aspect dropped the error rate down to
around 40-45%, pointing in the same direction.
3.7. Conclusion
The Word Error Rate, with all its flaws, is still the main met-
ric used to assess the quality of automatic transcription sys-
tems. Applying it to overlapping speech requires first general-
izing the metric, with a possibility of extending the task to com-
bine diarization and transcription, and generalizing the align-
ment methodology to take the new information into account.
The last presented variant, Speaker-Attributed word error
rate with confusion estimation seems a-priori the most appro-
priate in an extended task setup. The optimal-attribution one
is the only one usable for a traditional, non speaker-attributed,
system output. In any case systems will need to be much bet-
ter on overlapping speech sections before the pertinence of the
metrics can be really evaluated.
4. Global conclusions
We proposed generalizations of the usual metrics of diarization
and automatic speech recognition to an extended context, cross-
show speaker recognition and overlapping speech. More im-
portantly we described the implementation methodologies with
their algorithmic underpinnings to ensure a better reproductibil-
ity of the measurements results obtained.
The associated evaluation tools will be provided as part of
the Etape corpus/evaluation package to be distributed by ELDA,
and should also soon be available under the GPL as a part of a
LNE-created toolbox of evaluation tools.
5. Acknowledgments
This work was partially funded by a combination of:
•the Quaero Programme, funded by OSEO, French State
agency for innovation
•the ANR Etape project
•the ANR/DGA Repere project
6. References
[1] D. S. Pallett, “Performance Assessment of Automatic Speech Rec-
ognizers,” Res. National Bureau of Standards, vol. 90, no. 5, pp.
371–387, sept-oct 1985.
[2] NIST, “2000 Speaker Recognition Evaluation
- Evaluation Plan,” 2000. [Online]. Avail-
able: http://www.itl.nist.gov/iad/mig/tests/sre/2000/spk-2000-
plan-v1.0.htm
[3] Q. Yang, Q. Jin, and T. Schultz, “Investigation of cross-show
speaker diarization,” in INTERSPEECH’11, 2011, pp. 2925–2928.
[4] G. Gravier, G. Adda, N. Paulsson, M. Carr, A. Giraudel, and
O. Galibert, “The etape corpus for the evaluation of speech-based
tv content processing in the french language,” in Proceedings of
the Eight International Conference on Language Resources and
Evaluation (LREC’12), N. C. C. Chair), K. Choukri, T. Declerck,
M. U. ur Do an, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis,
Eds. Istanbul, Turkey: European Language Resources Association
(ELRA), may 2012.
[5] J. Kahn, O. Galibert, M. Carr, A. Giraudel, P. Joly, and L. Quintard,
“The REPERE Challenge: finding people in a multimodal context
(regular paper),” in Odyssey - The Speaker and Language Recog-
nition Workshop, Singapour, 25/06/2013-28/06/2013, juin 2013, p.
(electronic medium).