Conference PaperPDF Available

Methodologies for the evaluation of Speaker Diarization and Automatic Speech Recognition in the presence of overlapping speech

Authors:

Abstract

Speaker Diarization and Automatic Speech Recognition have been a topic of research for decades. Evaluating the devel-oped systems has been required for almost as long. Following the NIST initiatives a number of metrics have become standard to handle these evaluations, namely the Diarization Error Rate and the Word Error Rate. The initial definitions of these metrics and, more impor-tantly, their implementations, were designed for single-speaker speech. One of the aims of the OSEO Quaero and the ANR ETAPE projects was to investigate the capabilities of Diariza-tion and ASR systems in the presence of overlapping speech. Evaluating said systems required extending the metrics def-initions and adapting the algorithmic approaches required for their implementation. This paper presents these extensions and adaptations and the open tools that provide them.
Methodologies for the evaluation of Speaker Diarization and Automatic
Speech Recognition in the presence of overlapping speech
Olivier Galibert
Laboratoire national de m´
etrologie et d’essais, Trappes, France
Olivier.Galibert@lne.fr
Abstract
Speaker Diarization and Automatic Speech Recognition
have been a topic of research for decades. Evaluating the devel-
oped systems has been required for almost as long. Following
the NIST initiatives a number of metrics have become standard
to handle these evaluations, namely the Diarization Error Rate
and the Word Error Rate.
The initial definitions of these metrics and, more impor-
tantly, their implementations, were designed for single-speaker
speech. One of the aims of the OSEO Quaero and the ANR
ETAPE projects was to investigate the capabilities of Diariza-
tion and ASR systems in the presence of overlapping speech.
Evaluating said systems required extending the metrics def-
initions and adapting the algorithmic approaches required for
their implementation. This paper presents these extensions and
adaptations and the open tools that provide them.
Index Terms: evaluation metrics, speaker diarization, auto-
matic speech recognition, overlapping speech, robustness
1. Introduction
Two useful technologies in the handling of speech are Diariza-
tion systems, where one has to group together speech segments
of the same speaker, and Transcription systems where what is
said has to be transcribed. Evaluating the output of such sys-
tems is helpful not only to compare the quality of systems but
also to assess whether a particular approach allows to get better
results. As a result evaluation has always been one of the tools
required to be able to develop better systems.
The NIST has long identified that need for evaluation
and has organized numerous evaluations of these technologies
among others. By doing that they established what has become
the standard metrics of these domains, the Word Error Rate in
1987 [1] and the Diarization Error Rate in 2000 [2]. These met-
rics have their limits and alternatives are proposed from time to
time but system developers always tend to come back to these
tried and true ones.
Within the Quaero [3] and Etape [4] we decided to evaluate
diarization and transcription systems in somewhat more difficult
conditions. In particular two specifics impacted the evaluation
process:
Cross-show conditions, where speakers have to be de-
tected when recurring in different shows
Overlapping speech, where multiple speakers speak si-
multaneously and must be both detected and transcribed
The existing NIST tools were reaching their limits under
these conditions, and we had to both generalize the metrics def-
initions and rethink the basis of their application to make them
usable under the new context and stay comparable with previous
results.
This paper then describes both the generalization of the
high-level metrics used in these two areas, but also the algo-
rithmic underpinnings to their application in real tools.
2. Speaker Diarization
2.1. The task
Speaker Diarization, also called ”who spoke when”, is the pro-
cess of identifying, for each speaker of an input audio recording,
all the regions where he/she is talking. Each temporal region
containing speech should be labeled with at least one speaker-
tag and segments from the same speaker shall be labeled with
the same tag. Speaker tags are not identities but abstract labels.
As such, diarization technology is not a final applicative system,
but is instead a first step towards full speaker identification or,
alternatively, the segmentation and clusterization step preceding
an automatic speech recognizer.
2.2. Diarization Error Rate
The main metric for diarization performance measurement is
the Diarization Error Rate. It has been introduced by the NIST
in 2000 within the Speaker Recognition evaluation [2] for their
then-new speaker segmentation task. The metric is computed in
two steps: the first step is to establish a mapping between the
speaker tags provided by the system and the speaker identities
found in the reference. The second step then computes the error
rate using that mapping.
Computing an error rate requires defining what the errors
can be. Three error types are defined in the diarization context:
The confusion error, when the system-provided speaker
tag and the reference do not match through the mapping.
The miss error, when speech is present in the reference
but no speaker is present in the hypothesis.
The false alarm error when speech has incorrectly been
detected by the system.
These errors happen on segments of speech. The dura-
tions of the segments are summed together, giving us a time
in error. Note that the miss and false alarm cases are slightly
more complex when overlapping speech happens: false alarm
or miss time is present when more (respectively less) speakers
are present in the hypothesis than in the reference in a given
time interval, it’s not all confusion.
To get an error rate that can be compared between evalua-
tions with different speech durations, the time in error must be
divided by a normalization value. Multiple methods can define
that normalization values, the NIST-proposed one follows two
principles:
The normalization value should not depend on the sys-
tem hypothesis, but only on the reference.
An empty hypothesis should have an error rate of 1.
The first principle is useful to avoid weird system tuning
effects where increasing the divider can become more important
that reducing the time in error. The second is an easy way to
ensure a decent comparability between results. In our case with
an empty hypothesis all the reference speech time ends up in the
miss category, with the other two empty. That gives us the final
DER definition as:
DER =confusion + miss + false alarm
total reference speech time (1)
That equation leaves open the question of the mapping. Of
all the possible mappings the one chosen is the one given the
lowest error rate.
A last point in building that metric is to take into account the
intrinsic human imprecision. It is very hard to define a precise
point in time when speech starts or stops, especially when back-
ground noise or overlapping speech is present. So some flexibil-
ity must be given to the reference speech boundaries. The NIST
approach to cope with that problem is to remove from scoring an
interval of +/- 250ms around every reference speech boundary.
While this is a good solution for broadcast news type speech
where speech turns are long, we found out that on more active
debate-type shows, where overlapped speech happens naturally,
up to 40% of the speech was removed from the evaluation. So
we had to use a different method. We simply decided that for
two speakers mapped together there could not be any miss or
false alarm errors in a +/- 250ms interval around the reference
speaker speech boundaries. That method has two advantages:
The tolerance only happens for the reference speaker
that starts or stops speaking and its associated hypoth-
esis speaker and does not influence other speakers that
are not in such a transition at the same time.
When a reference speaker is not mapped all his time is
missed, avoiding surprising results for an empty hypoth-
esis.
2.3. Algorithmic approach to the computation of the metric
The NIST tools available to compute the DER metric use an
heuristic to build the best possible mapping. They work quite
well for broadcast news-type situations but sometimes fail when
a lot of overlapping speech happens. They also often fail
in cross-show conditions, where identical speakers have to be
identified when they are present in different audio files. In the
even more extreme cases of video diarization the equivalent of
overlapping speech is having multiple persons on screen. That
situation happens quite often in TV shows, and the tools just
end up running in a seemingly infinite loop. So a more robust
approach was needed.
The problem is to find a mapping that minimizes the to-
tal error. The error is built of the miss error, the false alarm
error and the confusion error. The amount of miss error and
false alarm error is almost independent of the mapping. One
just has to find the intervals when the number of speakers in
the reference and in the hypothesis are different and add that
interval’s duration for every extra speaker. The “almost” stems
from the fuzzy frontier handling. Having two speakers associ-
ated can slightly decrease the amount in error. Similarly, the
confusion time is easy to compute when no speaker are associ-
ated between reference and hypothesis: for every interval time
in confusion is added for every potential speaker association,
or, in mathematical terms, the interval time multiplied by the
minimum of the number of speakers in the reference and in the
hypothesis. When two speakers are associated the time in con-
fusion is reduced by the amount of time in which the reference
and hypothesis speakers are both active.
So the total error can be split into a base error which is in-
dependent of the mapping and is computed when no speakers
are mapped together, and a delta error that reduces it thanks to
the mapping. Optimizing the error rate through the mapping
is equivalent to maximizing the delta error. Said delta has an
important property: it can be decomposed as a sum of individ-
ual, per-association values. If one takes a given mapping with
a number of unassociated speakers, and two more speakers are
associated together, the delta error is increased by:
The amount of time in common between the speakers,
which is a confusion reduction.
The amount of time when only one of the speakers
is present in the 250ms intervals around the reference
speaker speech frontiers, which is a miss or false alarm
reduction.
Neither of these two values is dependent of the rest of the map-
ping.
We can then reformulate the mapping problem into first
computing the individual per-association delta errors, then find
a set of association that maximizes their sum. Defining:
Rand Has the sets of speakers of the reference and the
hypothesis.
Er,h, r R, h Has the delta error for the asso-
ciation between speaker rof the reference and hof the
hypothesis.
A={(r1, h1), ..., (rn, hn)}a mapping, such that
(r, h)A, (r0, h0)A, r =r0h=h0
Mthe set of all possible mappings.
We are trying to compute:
mapping = argmaxAMX
(r,h)A
Er,h (2)
That computation is a good example of an Assignment
Problem, which is solved deterministically in o(n3)time with
the Hungarian Algorithm. It is interesting to note that the NIST
has identified that algorithm to be appropriate for the mapping
estimation problem but does not seem to actually have used it
in its tools.
Once the mapping established computing the final score
poses no particular difficulties.
2.4. Evaluation experiments
We applied this evaluation methodology in a number of evalua-
tions, namely within Quaero, Etape and the ANR/DGA project
Repere [5] on the full identification of speakers in TV shows. A
very recent evaluation in that last project gives the best systems
a performance of 11% DER in single-show diarization and 14%
DER in cross-show diarization on a set of 3 hours of TV shows
extracts. These shows include debates with overlapping speech,
with in addition a mix of recurring and non-recurring speakers
between shows. As a comparison, not clustering speakers be-
tween the shows gives a cross-show DER in the high 50%.
These figures indicates that the systems are quite efficient
in their tasks, including the cross-show one. This is also con-
firmed in practice, with the diarization being one of the main
information sources for the Repere global task of fully identify-
ing speakers in multiple shows.
2.5. Conclusion
The diarization error is still a pertinent metric to measure the
quality of a diarization system in more complex setups includ-
ing:
Cross-show diarization, when recurring speakers in mul-
tiple shows have to be recognized.
Overlapping speech.
The main difficulty of implementing that metric is estab-
lishing the speaker mapping between reference speakers IDs
and hypothesis speaker tags. Reformulating the problem as an
assignment problem allows for a deterministic, time-bounded
solution with guaranteed optimality. In addition the narrower
way the error tolerance at reference speech frontiers is inter-
preted makes it usable in these new contexts where otherwise
too much speech time could not be evaluated.
3. Automatic Speech Recognition
3.1. Word Error Rate
The Word Error Rate (WER) is kept as the primary evaluation
metric for this evaluation. That metric basically counts the num-
ber of word deletions, insertions and substitutions in the output
of the automatic transcription system compared to a reference
transcription produced by humans.
More precisely, the word error rate can be computed as
shown in Equation 3:
W ER =S+D+I
N(3)
where:
S is the number of substitutions,
D is the number of the deletions,
I is the number of the insertions,
N is the number of words in the reference transcription.
While simple in theory, applying this metric has some sub-
tleties, related to ambiguities in the language. In languages
where not all letters are pronounced entire text spans can have
multiple ways to be interpreted and hence written down (in
French for instance that case sometimes happens with genders
and/or plurals). Also, some words or expressions can have mul-
tiple ways of being written, like the English contractions. As a
result, the comparison is in practice not of two sentences but of
two direct acyclic graphs of words.
3.2. Word Error Rate computation
Computing the WER necessitates producing the best alignment
between the reference and hypothesis graphs where two paths
are found in the graphs and most of the words on these paths
are associated together. The associated words are either cor-
rect, where they’re identical, or substitutions where they’re not,
and the unassociated words are insertions on the hypothesis
side and deletions on the reference side. The best alignment
is defined as being the one with the lowest number of substitu-
tions+insertions+deletions, and, in case of equality, the highest
number of words in the reference graph path. Multiple align-
ments are almost always possible but following the definition
of optimality give the same final score, so choosing one of them
is enough for scoring purposes, even if it may not be optimal
from a diagnostic point of view.
Establishing the best alignment is a long solved problem
using dynamic programming. Both graphs are sorted in a topo-
logical order and a 2D grid of scores is built using the flattened
graphs as X and Y axis, putting the words in the transition from
one line or column to the next. Scores are computed incremen-
tally in every slot of the grid by trying multiple possible steps:
Advance in both reference and hypothesis, giving a cor-
rect or substitution.
Advance in the reference graph only giving a deletion.
Advance in the hypothesis graph only, giving an inser-
tion.
The best step, i.e. giving the smaller number of errors or,
in case of equality, the largest number of reference words, is
chosen at each grid slot. When multiple branches reach the
same slot their scores are compared to only keep the best one.
The best score, and by backtracking the best alignment, is then
present in the last slot.
3.3. Speaker-attributed word error rate
The Word Error Rate computation we just presented requires
hypothesis words to be attributed to individual speaker turns.
When overlapping speech is not present it is naturally done by
associating hypothesis words to the speaker turn in which they
temporally fall. Words that do not fall in a speaker turn are
insertions. But that is, of course, not directly possible in the case
of overlapping speech since, by definition, multiple speakers
turns may exist for the same point in time.
The most immediate way to solve this issue is to ask of the
ASR systems to provide a speaker-tag, per the diarization defi-
nition, with every word. A diarization-type mapping between
speakers of the reference and speaker tags of the hypothesis
is built using the overlapping-speech compatible methodology
presented in the section 2. Once this mapping established hy-
pothesis words are only attributed to speaker turns where the
speaker matches, removing any ambiguity. A standard WER
computation can then be done.
That approach works but has two problems. The first is ask-
ing the systems to provide speaker tags. While such a request
makes perfect sense from an applicative point of view, espe-
cially when coupled with a speaker identification system, not all
laboratories have an integrated enough system to be able to pro-
vide them without massive refactoring. The second point, more
problematic, is that a speaker error is handled in an extremely
harsh manner: an otherwise correct word becomes a deletion
in the correct speaker’s turn and an insertion in the wrong one.
And that even in all the parts when no overlapping speech is
present. As a result the importance given to the quality of the
diarization becomes probably too high compared to the quality
of the ASR.
3.4. Optimally speaker-attributed word error rate
To make the speaker attribution less of an issue an alternative
approach is to have the evaluation tool make the attribution it-
self. An additional dimension is added to the alignment cre-
ation, choosing which speaker the word should be associated to.
The approach stays the same: of all possible alignments, includ-
ing speaker associations, the one giving the best score is cho-
sen. Thankfully, the dynamic programming methodology can
be generalized to make the attribution sub-problem tractable.
Instead of having one hypothesis graph and one reference
graph we then have one hypothesis graph and multiple refer-
ence graphs, one for each speaker present in the time interval
evaluated. The two-dimensional grid is generalized to a n-
dimensional box where nis the number of reference speakers
plus one. The possible steps then are generalized:
Advance in one of the references and the hypothesis, giv-
ing a correct or substitution, and assign the hypothesis
word to the chosen reference speaker. This is only possi-
ble if the previously assigned word to that speaker does
not overlap the current one.
Advance in one of the reference graphs, giving a dele-
tion.
Advance in the hypothesis graph, giving an insertion.
In addition the same branch folding happen when multiple
branches end up in the same slot. Backtracking gives both the
alignment and the word attribution. Insertions are not assigned
to a specific reference speaker.
3.5. Speaker-attributed with confusion
Optimal attribution works well, but explicit speaker tags are
interesting from an applicative point of view. So an evalua-
tion methodology taking these tags into account but less harsh
than directly separating the words in independent turns is use-
ful. A cross of the two previous evaluation methodologies can
be built which is pertinent for that case. First the diarization-
inspired mapping is built as for the speaker-attributed evalua-
tion. Then the optimally speaker-attributed evaluation is done
with an added error subtype to the correct/substitution case,
wrong speaker. That is, the impact on the score of advancing
on both one reference and the hypothesis includes a speaker
comparison component, where a cost is attributed to having the
wrong speaker. The value of said cost can be fixed depending
on how important the correct speaker repartition is for the ap-
plication. The rest of the methodology works identically.
3.6. Evaluation experiments
These metrics were tried within the Etape evaluation and while
the results are not official, hence publishable, yet, it could be
seen that for a system at 20% WER on non-overlapping speech
the optimized, non-speaker attributed error rate was at 30-35%
which, in practice, corresponds to a roughly 100% error rate on
the overlapping speech parts.
The speaker attributed variant in the same condition is ex-
tremely rough, with a 50%+ error rate, showing the damage a
wrong speaker cluster can do to the non-overlapping speech.
Adding the confusion aspect dropped the error rate down to
around 40-45%, pointing in the same direction.
3.7. Conclusion
The Word Error Rate, with all its flaws, is still the main met-
ric used to assess the quality of automatic transcription sys-
tems. Applying it to overlapping speech requires first general-
izing the metric, with a possibility of extending the task to com-
bine diarization and transcription, and generalizing the align-
ment methodology to take the new information into account.
The last presented variant, Speaker-Attributed word error
rate with confusion estimation seems a-priori the most appro-
priate in an extended task setup. The optimal-attribution one
is the only one usable for a traditional, non speaker-attributed,
system output. In any case systems will need to be much bet-
ter on overlapping speech sections before the pertinence of the
metrics can be really evaluated.
4. Global conclusions
We proposed generalizations of the usual metrics of diarization
and automatic speech recognition to an extended context, cross-
show speaker recognition and overlapping speech. More im-
portantly we described the implementation methodologies with
their algorithmic underpinnings to ensure a better reproductibil-
ity of the measurements results obtained.
The associated evaluation tools will be provided as part of
the Etape corpus/evaluation package to be distributed by ELDA,
and should also soon be available under the GPL as a part of a
LNE-created toolbox of evaluation tools.
5. Acknowledgments
This work was partially funded by a combination of:
the Quaero Programme, funded by OSEO, French State
agency for innovation
the ANR Etape project
the ANR/DGA Repere project
6. References
[1] D. S. Pallett, “Performance Assessment of Automatic Speech Rec-
ognizers,” Res. National Bureau of Standards, vol. 90, no. 5, pp.
371–387, sept-oct 1985.
[2] NIST, “2000 Speaker Recognition Evaluation
- Evaluation Plan,” 2000. [Online]. Avail-
able: http://www.itl.nist.gov/iad/mig/tests/sre/2000/spk-2000-
plan-v1.0.htm
[3] Q. Yang, Q. Jin, and T. Schultz, “Investigation of cross-show
speaker diarization,” in INTERSPEECH’11, 2011, pp. 2925–2928.
[4] G. Gravier, G. Adda, N. Paulsson, M. Carr, A. Giraudel, and
O. Galibert, “The etape corpus for the evaluation of speech-based
tv content processing in the french language,” in Proceedings of
the Eight International Conference on Language Resources and
Evaluation (LREC’12), N. C. C. Chair), K. Choukri, T. Declerck,
M. U. ur Do an, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis,
Eds. Istanbul, Turkey: European Language Resources Association
(ELRA), may 2012.
[5] J. Kahn, O. Galibert, M. Carr, A. Giraudel, P. Joly, and L. Quintard,
“The REPERE Challenge: finding people in a multimodal context
(regular paper),” in Odyssey - The Speaker and Language Recog-
nition Workshop, Singapour, 25/06/2013-28/06/2013, juin 2013, p.
(electronic medium).
... Performance of speaker diarization systems is usually reported in terms of diarization error rate (DER) [84] and more recently of Jaccard error rate JER [67] that gives a higher weight on clustering errors. Segmentation is evaluated by combining purity and coverage additionally with detection error rate that provides information on the quality of speech activity detection as in [85]. ...
... Diarization results are reported using four metrics: the weighed diarization error rate (DER) [84], the weighed Jaccard error rate (JER) which, unlike DER, considers an equal weight for each speaker [67], Purity of clusters and Coverage of speakers [85]. ...
Article
Full-text available
This paper introduces the resources necessary to develop and evaluate human assisted lifelong learning speaker diarization systems. It describes the ALLIES corpus and associated protocols, especially designed for diarization of a collection audio recordings across time. This dataset is compared to existing corpora and the performances of three baseline systems, based on x-vectors, i-vectors and VBxHMM, are reported for reference. Those systems are then extended to include an active correction process that efficiently guides a human annotator to improve the automatically generated hypotheses. An open-source simulated human expert is provided to ensure reproducibility of the human assisted correction process and its fair evaluation. An exhaustive evaluation, of the human assisted correction shows the high potential of this approach. The ALLIES corpus, a baseline system including the active correction module and all evaluation tools are made freely available to the scientific community.
... Considering that the speaker labels in the hypothesis do not represent the actual identity (names, surnames) of the speakers they should be first match with the identifiers from the reference. This matching is done to maximise the correspondence between hypothesis and reference [153]. The computation of the DER consists in summing three different errors: Miss, False alarm and Confusion. ...
Thesis
This research aims at designing an autonomous speaker diarization system able to adapt and evaluate itself. The goal of this research was to apply the concept of human-assisted lifelong learning to the speaker diarization task, also known as speaker segmentation and clustering. More specifically, this work aims at designing an efficientway of interaction between the automatic diarization system and a human domain expert to improve the quality ofdiarization generated by an automatic system while limiting the workload for the human domain expert. This manuscriptproposes an alternative point of view on the definition of the lifelong learning intelligent systems, a dataset designed for evaluation of the lifelong learning diarization systems and the metric for evaluation of human-assisted systems.The main contribution of this work lies in the development of the human-assisted within-show and cross-show diarization methods.
... Results. For the evaluation metrics, we use the diarization error rate (DER) [34] and Jaccard error rate (JER) [35]. The forgiveness collar was set to 0.25, and overlapped speech regions were excluded from evaluation for DER, however, JER is calculated with no forgiveness collar and includes overlapped speech [35]. ...
Preprint
This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while keeping the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm.
... Otherwise, more adequate metrics should be applied, such as the Letter or Character Error Rate (LER or CER) (Kurimo et al. 2006), the Syllable Error Rate (SylER) (C. Huang et al. 2000) or Speaker Attributed Word Error Rate (SA-WER) (Galibert 2013). Use of OPD (Output Probability Distributions) and secondary classification is a solution to improve accuracy of ASR isolated word in limited vocabulary (Thambiratnam and Sridharan 2000). ...
Article
Full-text available
Automatic Speech Recognition (ASR) is an active field of research due to its large number of applications and the proliferation of interfaces or computing devices that can support speech processing. However, the bulk of applications are based on well-resourced languages that overshadow under-resourced ones. Yet, ASR represents an undeniable means to promote such languages, especially when designing human-to-human or human-to-machine systems involving illiterate people. An approach to design an ASR system targeting under-resourced languages is to start with a limited vocabulary. ASR using a limited vocabulary is a subset of the speech recognition problem that focuses on the recognition of a small number of words or sentences. This paper aims to provide a comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possible future directions in ASR using a limited vocabulary. This work consequently provides a way forward when designing an ASR system using limited vocabulary. Although an emphasis is put on limited vocabulary, most of the tools and techniques reported in this survey can be applied to ASR systems in general. AbbreviationsACC: Accuracy; AM: Acoustic Model; ASR: Automatic Speech Recognition; BD-4SK-ASR: Basic Dataset for Sorani Kurdish Automatic Speech Recognition; CER: Character Error Rate; CMU: Carnegie Mellon University; CNN: Convolutional Neural Network; CNTK: CogNitive ToolKit; CUED: Cambridge University Engineering Department; DCT:Discrete Cosine Transformation; DL: Deep Learning; DNN: Deep Neural Network; DRL: Deep Reinforcement Learning; DWT: Discrete Wavelet Transform; FFT: Fast Fourier Transformation; GMM: Gaussian Mixture Model; HMM: Hidden Markov Model; HTK: Hidden Markov Model ToolKit; JASPER: Just Another Speech Recognizer; LDA: Linear Discriminant Analysis; LER: Letter Error Rate; LGB: Light Gradient Boosting Machine; LM:Language Model; LPC: Linear Predictive Coding; LVCSR: Large Vocabulary Continuous Speech Recognition; LVQ: Learning Vector Quantization Algorithm; MFCC: Mel-Frequency Cepstrum Coefficient; ML: Machine Learning; PCM:Pulse-Code Modulation; PPVT: Peabody Picture Vocabulary Test; RASTA: RelAtive SpecTral; RLAT: Rapid Language Adaptation Toolkit; S2ST: Speech-to-Speech Translation; SAPI: Speech Application Programming Interface; SDK: Software Development Kit; SVASR:Small Vocabulary Automatic Speech Recognition; WER: Word Error Rate
... A common method to measure the performance of diarization systems is the Diarization Error Rate (DER) [81], [82]. The DER is defined as the fraction of the time that is not attributed correctly to a speaker or non-speech [38]. ...
Preprint
Full-text available
Speaker identification in noisy audio recordings, specifically those from collaborative learning environments, can be extremely challenging. There is a need to identify individual students talking in small groups from other students talking at the same time. To solve the problem, we assume the use of a single microphone per student group without any access to previous large datasets for training. This dissertation proposes a method of speaker identification using cross-correlation patterns associated to an array of virtual microphones, centered around the physical microphone. The virtual microphones are simulated by using approximate speaker geometry observed from a video recording. The patterns are constructed based on estimates of the room impulse responses for each virtual microphone. The correlation patterns are then used to identify the speakers. The proposed method is validated with classroom audios and shown to substantially outperform diarization services provided by Google Cloud and Amazon AWS.
Thesis
Full-text available
Clinical research seeks a voice monitoring device for everyday situations. This thesis investigates the extraction of vocal tract information (VTI) from a portable, lightweight, and wireless voice accelerometer (ACC). An experiment recorded participants' speech using two ACCs placed on the neck and cheek, comparing them to an acoustic microphone. The analysis focused on formant frequencies (FFs), inter-annotator agreement (IAA) for voice onset time (VOT), resistance to environmental noise, and accuracy of transcriptions using automatic speech recognition (ASR). FF extraction yielded unreliable and non-canonical vowel distributions. IAA showed agreement in voice onset between ACC and acoustic signal, but less for VOT start time and duration. Both placements resisted noise up to 85 dBA. However, ACC signals had a high Word error rate (WER), indicating poor recognition. These findings suggest limited VTI extraction from ACC signals, requiring further improvements before reliable VTI recording devices can be developed.
Article
Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets - in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution’s quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.
Article
Full-text available
The paper presents a comprehensive overview of existing data for the evaluation of spoken content processing in a multimedia framework for the French language. We focus on the ETAPE corpus which will be made publicly available by ELDA at the end of 2012, after completion of the evaluation, and recall existing resources resulting from previous evaluation campaigns. The ETAPE corpus consists of 30 hours of TV and radio broadcasts, selected to cover a wide variety of topics and speaking styles, emphasizing spontaneous speech and multiple speaker areas.
Article
This paper discusses the factors known to influence the performance of automatic speech recognizers and describes test procedures for characterizing their performance. It is directed toward all the stakeholders in the speech community (researchers, vendors and users) consequently, the discussion of test procedures is not directed toward the needs of specific users to demonstrate the performance characteristics of any one specific algorithmic approach or particular product. It relies significantly on contributions from an emerging consensus standards activity, especially material developed within the IEEE Working Group on Speech I/O Performance Assessment.
Conference Paper
The goal of cross-show diarization is to index speech segments of speakers from a set of shows, with the particular challenge that reappearing speakers across shows have to be labeled with the same speaker identity. In this paper, we introduce three cross-show diarization systems namely Global-BIC-Seg, Global-BIC-Cluster, and Incremental. We compared the three systems on a set of 46 English scientific podcast shows. Among the three systems, the Global-BIC-Cluster achieves the best performance with 15.53 % and 13.21 % cross-show diarization error rate (DER) on the dev and test set, respectively. However, an incremental approach is more practical since data and shows are typically collected over time. By applying T-Norm on our incremental system, we obtain 13.18 % and 10.97 % relative improvements in terms of cross-show DER on dev and test set. We also investigate the impact of the show processing order on cross-show diarization for the incremental system. Index Terms: speaker diarization, cross-show diarization, conversational podcast shows
The REPERE Challenge: finding people in a multimodal context (regular paper),” in Odyssey - The Speaker and Language Recog-nition Workshop, Singapour, 25/06 Recognition
  • J Kahn
  • O Galibert
  • M Carr
  • A Giraudel
  • P Joly
  • L Quintard
J. Kahn, O. Galibert, M. Carr, A. Giraudel, P. Joly, and L. Quintard, “The REPERE Challenge: finding people in a multimodal context (regular paper),” in Odyssey - The Speaker and Language Recog-nition Workshop, Singapour, 25/06/2013-28/06/2013, juin 2013, p. (electronic medium). Recognition 2000. Evaluation Avail- [Online].
The etape corpus for the evaluation of speech-based tv content processing in the french language
  • G Gravier
  • G Adda
  • N Paulsson
  • M Carr
  • A Giraudel
  • O Galibert
G. Gravier, G. Adda, N. Paulsson, M. Carr, A. Giraudel, and O. Galibert, "The etape corpus for the evaluation of speech-based tv content processing in the french language," in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), N. C. C. Chair), K. Choukri, T. Declerck, M. U. ur Do an, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, Eds. Istanbul, Turkey: European Language Resources Association (ELRA), may 2012.
The REPERE Challenge: finding people in a multimodal context (regular paper)," in Odyssey-The Speaker and Language Recognition Workshop
  • J Kahn
  • O Galibert
  • M Carr
  • A Giraudel
  • P Joly
  • L Quintard
J. Kahn, O. Galibert, M. Carr, A. Giraudel, P. Joly, and L. Quintard, "The REPERE Challenge: finding people in a multimodal context (regular paper)," in Odyssey-The Speaker and Language Recognition Workshop, Singapour, 25/06/2013-28/06/2013, juin 2013, p. (electronic medium).