Conference PaperPDF Available

The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring

Authors:

Figures

Content may be subject to copyright.
The INTERSPEECH 2017 Computational Paralinguistics Challenge:
Addressee, Cold & Snoring
Bj¨
orn Schuller1,2, Stefan Steidl3, Anton Batliner2,3, Elika Bergelson4, Jarek Krajewski5,
Christoph Janott6, Andrei Amatuni4, Marisa Casillas7, Amanda Seidl8, Melanie Soderstrom9,
Anne S. Warlaumont10, Guillermo Hidalgo5, Sebastian Schnieder5, Clemens Heiser6,
Winfried Hohenhorst11 , Michael Herzog12, Maximilian Schmitt2, Kun Qian6,Yue Zhang1,6,
George Trigeorgis1, Panagiotis Tzirakis1, Stefanos Zafeiriou1,13
1Department of Computing, Imperial College London, UK
2Chair of Complex & Intelligent Systems, University of Passau, Germany
3Pattern Recognition Lab, FAU Erlangen-Nuremberg, Germany
4Psychology and Neuroscience, Duke University, USA
5University of Wuppertal, Germany
6Technische Universit¨
at M¨
unchen, Germany
7Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
8Speech, Language, and Hearing Sciences, Purdue University, USA
9Psychology, University of Manitoba, Canada
10Cognitive and Information Sciences, University of California, Merced, USA
11Clinic for ENT Medicine, Head and Neck Surgery, Alfried Krupp Krankenhaus, Essen, Germany
12Clinic for ENT Medicine, Head and Neck Surgery, Carl-Thiem-Klinikum, Cottbus, Germany
13University of Oulu, Finland
schuller@IEEE.org
Abstract
The INTERSPEECH 2017 Computational Paralinguistics Chal-
lenge addresses three different problems for the first time in
research competition under well-defined conditions: In the Ad-
dressee sub-challenge, it has to be determined whether speech
produced by an adult is directed towards another adult or to-
wards a child; in the Cold sub-challenge, speech under cold
has to be told apart from ‘healthy’ speech; and in the Snoring
sub-challenge, four different types of snoring have to be clas-
sified. In this paper, we describe these sub-challenges, their
conditions, and the baseline feature extraction and classifiers,
which include data-learnt feature representations by end-to-end
learning with convolutional and recurrent neural networks, and
bag-of-audio-words for the first time in the challenge series.
Index Terms
: Computational Paralinguistics, Challenge, Ad-
dressee, Child Directed Speech, Speech under Cold, Snoring
1. Introduction
In this INTERSPEECH 2017 CO MP UTATI ONA L PARA LI N-
GUISTICS CHALLENGE (COMPARE) the ninth since 2009 [
1
],
we address three new problems within the field of Computational
Paralinguistics [2] in a challenge setting:
In the Addressee
(A)
Sub-Challenge, speech produced by an
adult has to be classified as directed to either adult or child. A
possible application is the monitoring of adult-(parent-) child
interaction [
3
] it is well-known that already babies should be
exposed to a highly elaborated verbal interaction.
In the Cold
(C)
Sub-Challenge, speech under cold has to be
told apart from speech under ‘normal’ health conditions. This
can enable the monitoring of call-centre and other telephone
interactions in order to better predict propagation of a cold [
4
,
5
].
In the Snoring
(S)
Sub-Challenge, a four-class classification
of snoring sounds has to be performed. Identification of the type
of snoring [
6
,
7
] can be highly useful for a targeted and thus
successful medical treatment [8].
For all tasks, a target value/class has to be predicted for each
case. Contributors can employ their own features and machine
learning algorithms; standard feature sets and procedures are
provided that may be used. Participants have to use predefined
training/development/test splits for each sub-challenge. They
may report development results obtained from the training set
(preferably with the supplied evaluation setups), but have only
a limited number of five trials to upload their results on the test
sets for the Sub-Challenges, whose labels are unknown to them.
Each participation must be accompanied by a paper presenting
the results, which undergoes peer-review and has to be accepted
for the conference in order to participate in the Challenge. The
organisers preserve the right to re-evaluate the findings, but will
not participate in the Challenge. As evaluation measure, we
employ Unweighted Average Recall (UAR) as used since the
first Challenge held in 2009 [
1
], especially because it is more
adequate for (more or less unbalanced) multi-class classifications
than Weighted Average Recall (i. e., accuracy).
In the next section 2, we describe the challenge corpora.
Section 3 details the baseline experiments and metrics as well as
the baseline results; a short conclusion is given in section 4.
2. Challenge Corpora
2.1. Addressee (A)
In this sub-challenge, we introduce the HOMEBANK
CHI LD/AD ULT AD DRE SS EE CORPUS (HB-CHAAC) (see Table
Copyright © 2017 ISCA
INTERSPEECH 2017
August 20–24, 2017, Stockholm, Sweden
http://dx.doi.org/10.21437/Interspeech.2017-433442
Table 1: Databases: Number of instances per class in the
train/devel/test splits used for the Challenge; CD: child-directed,
AD: adult-directed, C: cold, NC: non-cold, V: velum, O: oropha-
ryngeal lateral walls, T: tongue base, E: epiglottis.
# Train Devel Test Σ
Homebank Child/Adult Addressee Corpus (HB-CHAAC)
CD 2 302 2 182 blinded during challenge
AD 1 440 1 368 blinded during challenge
Σ3 742 3 550 3 594 10 886
Upper Respiratory Tract Infection Corpus (URTIC)
C 970 1 011 blinded during challenge
NC 8 535 8 585 blinded during challenge
Σ9 505 9 596 9 551 28 652
Munich-Passau Snore Sound Corpus (MPSSC)
V 168 161 blinded during challenge
O 76 75 blinded during challenge
T 8 15 blinded during challenge
E 30 32 blinded during challenge
Σ282 283 263 828
1). The task is to differentiate between speech produced by an
adult that is directed to a child (child-directed speech, CDS) or
directed to another adult (adult-directed speech, ADS). CDS is
understood to have particular acoustic-phonetic and linguistic
characteristics that distinguish it from ADS and is theorized to
play a critical role in promoting language development (e. g.,
[
9
] and references therein). However, to date there have been
few formal attempts to discriminate these forms of speech com-
putationally (cf. [
10
,
11
]). Furthermore, analyses of CDS vs
ADS have been restricted to highly constrained contexts. The
HB-CHAAC consists of a set of conversations (see below) se-
lected from a much larger corpus of real-world child language
recordings known as HomeBank [
12
] (homebank.talkbank.org).
A set of 20 such conversations was selected from a subset of
the available HomeBank recordings, from the following cor-
pora: [
13
], [
14
], and [
15
]. The subset of recordings that were
sampled (61 homes in total across four cities in North America)
featured: North American English as the primary language being
spoken, typically-developing children, participants who granted
permission to share the audio with the research community, and
a spread of ages sampled as uniformly as possible between 2 and
24 months and across the four contributing laboratory datasets,
with each child only sampled once, cf. Table 1. Recordings were
collected using the LENA recording device and software [
16
]
that provides automated identification of ‘conversational blocks’
(bouts of audio identified as speech bounded by 5 s of non-speech
on either end), which were used to select the 1 220 conversations,
consisting of total 2 523 minutes of recording. Individual adult
speaker audio clips within each conversation (as identified by
the LENA algorithm’s speaker diarisation) were then subjected
to hand-annotation for the challenge. Three trained research as-
sistants judged whether each clip was directed to a child (CDS)
or an adult (ADS) using both acoustic-phonetic information
and context (see
https://osf.io/d9ac4/
for more de-
tail). Clips deemed to be non-speech, not produced by an adult,
or ambiguous between CDS/ADS were excluded using a “Junk”
category. All CDS and ADS clips were additionally labelled
by the research assistant as to whether the speaker was male or
female. Annotators achieved high reliability in differentiating
CDS/ADS (Fleiss’ kappa > .75,p < .001).
Table 2: Distribution of recordings and child age across the four
contributing datasets; rec.: recordings, range: Age range in
months (mean)
Sub-Corpus # of rec. range
Bergelson Seedlings 44 6 -17 (11.75)
McDivitt 7 4 -19 (11.29)
VanDam2 3 5 -18 (12.33)
Warlaumont 7 2 -9 (3.71)
2.2. Cold (C)
For this sub-challenge, the UP PE R RES PIRATORY TR ACT IN-
FECTION CORPUS (URTIC) is provided by the Institute of
Safety Technology, University of Wuppertal, Germany (see
also Table 1). The corpus consists of recordings of 630 sub-
jects (382 m, 248 f, mean age 29.5 years, standard deviation
12.1 years, range 12-84 years), made in quiet rooms with a mi-
crophone/headset/hardware setup (sample rate 44.1 kHz, down-
sampled to 16 kHz, quantisation 16bit).
The participants had to complete different tasks, presented
to them on a computer monitor. The subjects were asked to
read out short stories, e. g., The North Wind and the Sun (widely
used within the field of phonetics), and Die Buttergeschichte (a
standard reading passage in German, used in speech and lan-
guage pathology). Furthermore, the participants were producing
voice commands as needed, e. g., for controlling driver assistance
systems, and numbers from 1 to 40. Besides scripted speech,
spontaneous narrative speech was recorded. Subjects were asked
to briefly tell about, e. g., their last weekend, their best vacation,
or to describe a picture. The whole session lasted from 15 min-
utes to 2 hours, while the number of tasks varied for each subject.
The available recordings were split into 28 652 chunks with a
duration between 3 s and 10s.
To obtain the state of health, each participant reported a bi-
nary one-item measure based on the German version of the Wis-
consin Upper Respiratory Symptom Survey (WURSS-24) [
17
],
assessing the symptoms of common cold. The global illness
severity item (on a scale of
0 =
not sick to
7 =
severely sick)
was binarised using a threshold at 6. According to this binary la-
bel, the chunks were divided into speaker-independent partitions
(balanced w.r.t. gender, age, and experimenter) with 210 speak-
ers for each partition. In the training and development partitions,
37 participants were having a cold and 173 participants were not
having a cold. The number of chunks per subject varies; the total
duration is approximately 45 hours.
2.3. Snoring (S)
The MUNICH-PASS AU SNO RE SO UND CORPUS (MPSSC) is
introduced for classification of snore sounds by their excitation
location within the upper airways (UA). Snoring is generated
by vibrating soft tissue in the UA during inspiration in sleep.
Although simple snoring is not harmful for the snorer him- or
herself, it can affect sleep quality of the bed partner and cause so-
cial disturbance. There are numerous conservative and surgical
methods attempting to improve or cure snoring, many of them
showing only moderate success. Key to better clinical results is
a treatment exactly targeting the area in the UA where the snor-
ing sound is generated in the individual patient. Basic material
for the corpus are uncut recordings from Drug Induced Sleep
Endoscopy (DISE) examinations from three medical centres
recorded between 2006 and 2015. Recording equipment, micro-
phone type, and location differ between the medical centres, so
3443
do the background noise characteristics.
During a DISE procedure, a flexible nasopharyngoscope is
introduced into the UA while the patient is in a state of artifi-
cial sleep. Vibration mechanisms and locations can be observed
while video and audio signals are recorded. DISE is an estab-
lished diagnostic tool, which has a number of disadvantages: It
is time consuming, puts the patient under strain, and cannot be
performed during natural sleep. Therefore it is desirable to de-
velop alternative methods for the classification of snore sounds,
e. g., based on acoustic features.
More than 30 hours of DISE recordings have been auto-
matically screened for audio events. The extracted events were
manually selected, non-snore events and events disturbed by
non-static background noise (such as speech or signals from
medical equipment) were discarded. The remaining snore events
have been classified by ear, nose, and throat experts based on
findings from the video recordings. Only events with a clearly
identifiable, single site of vibration and without obstructive dis-
position were included in the database. Four classes are defined
based on the VOTE scheme, a widely used scheme distinguish-
ing four structures that can be involved in airway narrowing and
obstruction [18, 19]:
V
Velum (palate), including soft palate, uvula, lateral
velopharyngeal walls;
O
Oropharyngeal lateral walls, includ-
ing palatine tonsils;
T
Tongue, including tongue base and
airway posterior to the tongue base; E Epiglottis.
The resulting database contains audio samples (raw PCM,
sample rate 16 000 Hz, quantisation 16bit) of 828 snore events
from 219 subjects (see Table 1). The number of events per class
in the database is strongly unbalanced, with 84 % of samples
from the classes V and O, 11 % E-events, and 5 % T-snores.
This is in line with the likelihood of occurrence during normal
sleep [
20
,
21
]. Nevertheless, all classes are equally important for
therapy decisions. Snoring originating from the tongue base (T)
or from the epiglottis (E) require distinctly different treatment
than velum or oropharygeal snoring.
3. Experiments and Results
3.1. End-to-end Learning
For the first time in a CO MPAR E challenge, we provide results
using end-to-end learning (e2e) models. These deep models
have had huge success in the vision community and even more
recently in speech applications such as emotion recognition [
22
],
speaker verification [
23
], speech recognition [
24
], and further
audio analysis tasks (e. g., [
25
]). An attractive characteristic of
these models is that the optimal features for a given task can
be learnt purely from the data at hand, i. e., we aim to learn
simultaneously the optimal features and the classifier in a single
optimisation problem. Similar to [
22
] we use a convolutional
network to extract features from the raw time representation and
then a subsequent recurrent network (LSTM) which performs
the final classification. For training the network, we split the
raw waveform into chunks of 40ms each, as a good compromise.
These are fed into a convolutional network comprised by a series
of alternating convolution and pooling operations which try to
find a robust representation of the original signal (cf. participant
scripts). The extracted features are then subsequently fed to
M
LSTM modules (cf. Table 3) which compress the temporal
signal to a single final hidden state of the recurrent network
which is then used to perform the final classification
1
. As these
1
A detailed implementation of these models can be found at
https://github.com/trigeorgis/ComParE2017
models rely purely on the statistics of the available data to learn
the optimal features, we assume the available data to contain a
large amount of variation. We expect that the performance of the
e2e models can be improved by using smart data augmentation
techniques modelling the data distribution properly.
3.2. CO MPAR E Acoustic Feature Set
The official baseline feature set is the same as has been used in
the four previous editions of the INTERSPEECH CO MPAR E
challenges [
26
,
27
,
28
,
29
]. This feature set contains 6 373 static
features resulting from the computation of various functionals
over low-level descriptor (LLD) contours. The configuration file
is the IS13 ComParE.conf, which is included in the 2.1 public
release of openSMILE [
30
,
31
]. A full description of the feature
set can be found in [32].
3.3. Bag-of-Audio-Words
In addition to the default ComParE feature set, where func-
tionals (statistics) are applied to the acoustic LLDs, we pro-
vide bag-of-audio-words (BoAW) features. BoAW has already
been applied successfully for, e. g., acoustic event detection [
33
],
speech-based emotion recognition [
34
], and classification of
snore sounds [
35
]. Audio chunks are represented as histograms
of acoustic LLDs, after quantisation based on a codebook. One
codebook is learnt for the 65 LLDs from the CO MPAR E feature
set and one for the 65 deltas of these LLDs. In Table 3, results
are given for different codebook sizes. Codebook generation is
done by random sampling from the LLDs in the training data.
When fusing training and development data for the final model,
the codebook is learnt again from the fused data. The LLDs have
been extracted with the openSMILE toolkit [
31
], BoAW have
been computed using openXBOW [36].
3.4. Basics for the Challenge Baselines
The primary evaluation measure for the sub-challenges (all being
classification tasks) is Unweighted Average Recall (UAR). The
motivation to consider unweighted rather than weighted average
recall (‘conventional’ accuracy) is that it is also meaningful for
highly unbalanced distributions of instances among classes (as
is the case for the Ssub-challenge).
For the sake of transparency and reproducibility of the base-
line computation, we use open-source implementations from
the data mining algorithms (WEKA 3, revision 3.8.1; [
37
]) for
functionals and BoAW; in line with previous years, the machine
learning paradigm chosen is Support Vector Machines (SVM),
in particular WEKA’s SVM implementation with linear kernels.
In all tasks the Sequential Minimal Optimisation (SMO; [
38
]) as
implemented in WEKA was used as training algorithm.
Features were scaled to zero mean and unit standard devia-
tion (option
-N 1
for Weka’s SMO), using the parameters from
the training set (when multiple folds where used for develop-
ment, the parameters were calculated on the training set of each
fold). For all tasks, the complexity parameter
C
was optimised
during the development phase.
The results for late fusion are also reported. We fused the
predictions of the 3-layer e2e model and the CO MPAR E func-
tionals and BoAW models that performed best on the respective
development partitions. For the fusion of two models, for each in-
stance, the label with the highest confidence was chosen. For the
fusion of all three models, we selected the final prediction after
two different rules: the label with the highest sum of confidence
(conf.) and a majority vote (maj.).
3444
Table 3: Results for the three sub-challenges. The
official base-
lines
for test are highlighted (bold and greyscale). Dev: De-
velopment.
M
: Number of LSTM layers in end-to-end (e2e)
learning.
C
: Complexity parameter of SVM.
N
: Codebook
size of Bag-of-Audio-Words (BoAW) splitting the input into two
codebooks (ComParE-LLDs/ComParE-LLD-Deltas), with 10 as-
signments per frame, optimised complexity parameter of SVM.
UAR: Unweighted Average Recall.
UAR [%] Addressee Cold Snoring
Dev Test Dev Test Dev Test
Me2e: CNN + LSTM
259.8 60.1 59.1 60.0 37.0 37.9
360.9 59.1 58.6 59.6 40.3 40.3
CCOM PARE functionals + SVM
10655.8 65.8 62.9 63.9 29.3 48.4
10560.5 67.7 64.0 70.2 31.1 51.4
10461.8 67.6 61.7 66.5 40.6 58.5
10359.4 64.6 58.1 61.9 39.2 55.6
10257.4 60.9 58.8 59.5 39.2 55.6
10157.4 59.6 60.0 58.4 39.2 55.6
NCOM PARE BoAW + SVM
125/125 63.2 67.5 55.9 62.8 43.8 48.7
250/250 61.4 66.6 62.8 66.5 46.6 49.9
500/500 62.4 68.2 63.9 66.7 44.2 51.2
1000/1000 62.2 67.2 64.2 67.3 42.8 50.0
2000/2000 63.4 67.7 64.1 67.3 41.0 48.3
4000/4000 63.4 68.2 63.8 67.2 39.8 48.2
8000/8000 63.3 68.3 64.0 69.7 36.6 47.8
Models Late fusion
e2e+func 66.3 69.0 62.6 64.8 38.9 55.8
e2e+BoAW 67.8 68.4 62.7 62.5 45.1 46.0
func+BoAW 62.8 68.7 64.2 70.1 42.1 52.4
All (conf.) 66.4 70.2 66.1 70.7 43.5 53.0
All (maj.) 64.0 68.0 65.2 71.0 43.4 55.6
Each sub-challenge package includes scripts that allows
participants to reproduce the baselines and perform the testing
in a reproducible and automatic way (including pre-processing,
model training, model evaluation, and scoring by the competition
and further measures).
3.5. Baselines
This year, we introduced several new approaches: Besides the
usual CO MPAR E features plus SVM, we employ e2e and BoAW
plus SVM; additionally, we present different late fusion results.
By that, we as organisers have more than 5 trials available;
doing that, at the same time, we open new avenues of research
for the participants. This comes at a cost the results given in
Table 3 show a dilemma: If we followed “the rules of the game”
and take those results for test that correspond to the best results
for Dev(elopment), we would end up with challenge baselines
that are markedly below other results for test depicted in the
table. By that, participants could surpass the official baseline by
just repeating and/or slightly modifying the procedures leading
to the better results in Table 3. To avoid that, we thus decided
simply to choose the best test results for each sub-challenge as
official baseline. These results are still obtained by employing
non-optimised standard procedures; thus, there is ample space
for surpassing these figures.
As can be seen in Table 3, for the Adressee sub-challenge,
the baseline is UAR
= 70.2 %
; for the Cold sub-challenge,
it is UAR
= 71.0 %
, and for the Snoring sub-challenge, it is
UAR
= 58.5 %
. All but two of the 60 classifications (20 for each
of the three tasks) show the expected gain for Test in comparison
to Dev, due to the increased training set (Train plus Dev). This
is most pronounced for Snoring when functionals are employed
(up to
>
20% absolute difference). Note that the four classes in
this task display a highly un-balanced distribution, and that we
use UAR which means that mis-classifying a few cases (due
to different acoustic properties of a few subjects) or modelling
such cases better in a sparse class, influences UAR to a larger
extent than weighted average recall. The two factors responsible
for the mismatch of best Dev vs. best Test might be: (1) in the
20 classifications for each task, this might happen once simply
by chance, cf. the Addressee and Cold tasks. (2) the marked
difference that we only observe for Snoring might be due as well
to the unbalanced distribution between classes.
4. Conclusion
This year’s challenge is new in several respects; besides three
new tasks Addressee, Cold, and Snoring, all of them being
highly relevant for applications we introduced several new
procedures: Both e2e and BoAW are less knowledge-based
because they either do not need the usual extraction of features
(e2e) but only the time signal, or they do not need a lexicon but
generate a quasi-word representation themselves (BoAW). The
learning procedures employed for functionals and BoAW are
standard - competitive, but not optimised and kept generic for
all tasks by intention to provide transparent and easily re-doable
processing steps. The inclusion of e2e follows the evolution
of other machine learning tasks, where deep learning led to
large gains. Here, it is, however, not competitive by itself. Pre-
training with large corpora and domain adaptation techniques
could change this, but this was not considered here to maintain
transparency and consistency in the baselines.
For all computation steps, scripts are provided that can but
need not be used by the participants. We expect participants to
obtain considerably better performance measures by employing
novel (combinations of) procedures and features including such
tailored to the particular tasks. Beyond the tasks featured in
this challenge series, there remains a broad variety of further
information that is conveyed in the acoustics of speech and the
spoken words themselves that have not been dealt with either
at all or in a well-defined competition framework. Many of
these bear, however, great application potential, and remain to
be investigated more closely.
5. Acknowledgements
This research has received funding from the EU’s Framework
Programme HORIZON 2020 Grants No. 115902 (RADAR CNS)
and No. 645378 (ARIA-VALUSPA), the EU’s 7
th
Framework
Programme ERC Starting Grant No. 338164 (iHEARu), as well
as from SSHRC Insight Grant (#435-2015-0628), ERC Ad-
vanced Grant INTERACT (269484), NIH DP5-OD019812, and
NSF SBE-1539129 and NSF BCS-1529127. Further, the support
of the EPSRC Centre for Doctoral Training in High Performance
Embedded and Distributed Systems (HiPEDS, Grant Reference
EP/L016796/1) is gratefully acknowledged. The authors thank
the research assistants who provided HB-CHAAC labels and
Kelsey Dyck for help developing the labelling protocol, and the
sponsors of the Challenge: audEERING GmbH, and the Associ-
ation for the Advancement of Affective Computing (aaac). The
responsibility lies with the authors.
3445
6. References
[1]
B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising
realistic emotions and affect in speech: State of the art and lessons
learnt from the first Challenge,” Speech Communication, Special
Issue on Sensing Emotion and Affect Facing Realism in Speech
Processing, vol. 53, pp. 1062–1087, 2011.
[2]
B. Schuller and A. Batliner, Computational Paralinguistics Emo-
tion, Affect, and Personality in Speech and Language Processing.
Chichester, UK: Wiley, 2014.
[3]
M. L. Rowe, “Child-directed speech: relation to socioeconomic
status, knowledge of child development and child vocabulary skill,”
Journal of child language, vol. 35, no. 01, pp. 185–205, 2008.
[4]
P. J. Birrell, G. Ketsetzis, N. J. Gay, B. S. Cooper, A. M. Pre-
sanis, R. J. Harris, A. Charlett, X.-S. Zhang, P. J. White, R. G.
Pebody et al., “Bayesian modeling to unmask and predict influenza
A/H1N1pdm dynamics in London,” Proceedings of the National
Academy of Sciences, vol. 108, no. 45, pp. 18 238–18 243, 2011.
[5]
D. Lazer, R. Kennedy, G. King, and A. Vespignani, “The parable
of Google Flu: traps in big data analysis,” Science, vol. 343, no.
6176, pp. 1203–1205, 2014.
[6]
T. Mikami, Y. Kojima, M. Yamamoto, and M. Furukawa, “Recog-
nition of breathing route during snoring for simple monitoring of
sleep apnea,” in Proceedings of SICE Annual Conference. IEEE,
2010, pp. 3433–3434.
[7]
D. L. Herath, U. R. Abeyratne, and C. Hukins, “Hmm-based
snorer group recognition for sleep apnea diagnosis,” in Proceedings
EMBC. IEEE, 2013, pp. 3961–3964.
[8]
C. Janott, B. Schuller, and C. Heiser, “Acoustic information in
snoring noises,” HNO, vol. 65, pp. 107–116, 2017.
[9]
M. Soderstrom, “Beyond babytalk: Re-evaluating the nature and
content of speech input to preverbal infants, Developmental Re-
view, vol. 27, pp. 501–532, 2007.
[10]
A. Batliner, B. Schuller, S. Schaeffler, and S. Steidl, “Mothers,
Adults, Children, Pets Towards the Acoustics of Intimacy,” in
Proceedings of ICASSP, Las Vegas, NV, 2008, pp. 4497–4500.
[11]
S. Schuster, S. Pancoast, M. Ganjoo, M. C. Frank, and D. Juraf-
sky, “Speaker-independent detection of child-directed speech,” in
Spoken Language Technology Workshop (SLT), South Lake Tahoe,
CA, 2014, pp. 366–371.
[12]
M. VanDam, A. Warlaumont, E. Bergelson, A. Cristia, M. Soder-
strom, P. De Palma, and B. MacWhinney, An online repository of
daylong child-centered audio recordings,” Seminars in Speech and
Language, vol. 37, no. 2, pp. 128–42, 2016.
[13]
E. Bergelson, “Bergelson Seedlings HomeBank corpus,” 2016,
doi:10.21415/T5PK6D.
[14]
K. McDivitt and M. Soderstrom, “McDivitt HomeBank corpus,
2016, doi: 10.21415/T5KK6G.
[15]
A. S. Warlaumont and G. M. Pretzer, “Warlaumont HomeBank
corpus,” 2016, doi:10.21415/T54S3C.
[16]
C. R. Greenwood, K. Thiemann-Bourque, D. Walker, J. Buzhardt,
and J. Gilkerson, “Assessing childrens home language environ-
ments using automatic speech recognition technology, Communi-
cation Disorders Quarterly, vol. 32, pp. 83–92, 2011.
[17]
B. Barrett, R. L. Brown, M. P. Mundt, G. R. Thomas, S. K. Barlow,
A. D. Highstrom, and M. Bahrainian, “Validation of a short form
wisconsin upper respiratory symptom survey (wurss-21), Health
and Quality of Life Outcomes, vol. 7, p. 76, 2009.
[18]
N. Charakorn and E. Kezirian, “Drug-Induced Sleep Endoscopy,
Otolaryngol Clin North Am., vol. 49, pp. 1359–1372, 2016.
[19]
E. Kezirian, W. Hohenhorst, and N. de Vries, “Drug-induced sleep
endoscopy: the VOTE classification, Eur Arch Otorhinolaryngol.,
vol. 268, pp. 1233–1236, 2011.
[20]
N. Hessel and N. de Vries, “Diagnostic work-up of socially unac-
ceptable snoring. II. Sleep endoscopy, Eur Arch Otorhinolaryn-
gol., vol. 259, pp. 158–161, 2002.
[21]
J. A. Fiz and R. Jane, “Snoring Analysis. A Complex Question,” J
Sleep Disor: Treat Care, vol. 1, pp. 1–3, 2012.
[22]
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-
laou, B. Schuller, and S. Zafeiriou, “Adieu features? End-to-end
speech emotion recognition using a deep convolutional recurrent
network,” in Proceedings of ICASSP. IEEE, 2016, pp. 5200–5204.
[23]
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
text-dependent speaker verification, in Proceedings of ICASSP.
IEEE, 2016, pp. 5115–5119.
[24]
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“Wavenet: A generative model for raw audio,” 2016.
[25]
P. Smaragdis, “End-to-end music transcription using a neural net-
work,” Journal of the Acoustical Society of America, vol. 140, pp.
3038–3038, 2016.
[26]
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,
F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi,
M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and
S. Kim, “The INTERSPEECH 2013 Computational Paralinguis-
tics Challenge: Social Signals, Conflict, Emotion, Autism, in
Proceedings of INTERSPEECH, Lyon, France, 2013, pp. 148–152.
[27]
B. Schuller, S. Steidl, A. Batliner, J. Epps, F. Eyben, F. Ringeval,
E. Marchi, and Y. Zhang, “The INTERSPEECH 2014 Computa-
tional Paralinguistics Challenge: Cognitive & physical load,” in
Proceedings of INTERSPEECH, Singapore, 2014, pp. 427–431.
[28]
B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. H
¨
onig, J. R.
Orozco-Arroyave, E. N
¨
oth, Y. Zhang, and F. Weninger, “The IN-
TERSPEECH 2015 Computational Paralinguistics Challenge: De-
gree of Nativeness, Parkinson’s & Eating Condition,” in Proceed-
ings of INTERSPEECH, Dresden, Germany, 2015, pp. 478–482.
[29]
B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon,
A. Baird, A. Elkins, Y. Zhang, E. Coutinho, and K. Evanini, “The
INTERSPEECH 2016 Computational Paralinguistics Challenge:
Deception, Sincerity & Native Language, in Proceedings of IN-
TERSPEECH, 2016, pp. 2001–2005.
[30]
F. Eyben, M. W
¨
ollmer, and B. Schuller, “openSMILE The Mu-
nich Versatile and Fast Open-Source Audio Feature Extractor,” in
Proceedings of ACM Multimedia. Florence, Italy: ACM, 2010,
pp. 1459–1462.
[31]
F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent develop-
ments in openSMILE, the Munich open-source multimedia feature
extractor, in Proceedings of ACM MM, Barcelona, Spain, 2013,
pp. 835–838.
[32]
F. Eyben, Real-time Speech and Music Classification by Large Au-
dio Feature Space Extraction, ser. Springer Theses. Switzerland:
Springer International Publishing, 2015.
[33]
H. Lim, M. J. Kim, and H. Kim, “Robust sound event classification
using LBP-HOG based bag-of-audio-words feature representation,”
in Proceedings of INTERSPEECH, Dresden, Germany, 2015, pp.
3325–3329.
[34] M. Schmitt, F. Ringeval, and B. Schuller, “At the border of acous-
tics and linguistics: Bag-of-audio-words for the recognition of
emotions in speech,” in Proceedings of INTERSPEECH, San Fran-
cisco, USA, 2016, pp. 495–499.
[35]
M. Schmitt, C. Janott, V. Pandit, K. Qian, C. Heiser, W. Hem-
mert, and B. Schuller, A bag-of-audio-words approach for snore
sounds’ excitation localisation,” in Proceedings of ITG Speech
Communication, Paderborn, Germany, 2016, pp. 230–234.
[36]
M. Schmitt and B. W. Schuller, “openXBOW Introducing the
Passau Open-Source Crossmodal Bag-of-Words Toolkit, preprint
arXiv:1605.06778, 2016.
[37]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. H. Witten, “The WEKA data mining software: An update,” ACM
SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[38]
J. Platt, “Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods,” in Advances in
large margin classifiers, A. Smola, P. Bartlett, B. Sch
¨
olkopf, and
D. Schuurmans, Eds. Cambridge, MA: MIT Press, 1999, pp.
61–74.
3446
... Tull and Rutledge [4] observed that the formant frequency of speech during common cold is considerably lower than in healthy speech. The INTERSPEECH ComParE-2017 Challenge used the Upper Respiratory Tract Infection Corpus (URTIC) database for cold and healthy speech classification task [5]. For the purpose of detecting the common cold, Cai et al. [6] used perception-aware MFCC and constant Q cepstral coefficients (CQCC) features. ...
... This study uses the URTIC database utilized in INTER-SPEECH 2017 [5]. It consists of 28652 speech chunks of 630 subjects (382 males and 248 females). ...
... The baseline results of 64% and 64.20% were achieved using the ComParE features and BoAW features, respectively[5]. Cai et al.[6] achieved 64.80% and 65.40% UAR using MFCC and CQCC, respectively. ...
... Besides conventional speech features, such as mel-spectrograms or melfrequency cepstral coefficients (MFCCs), studies have examined a wide variety of features associated with health status. The openSMILE ComParE set [23], for example, has been used as a baseline across several challenges, such as the 2021 COVID-19 detection challenge [24], the 2017 cold&snoring recognition challenge [25], and the 2012 pathology subchallenge that predicts speech intelligibility for individuals that received cervical cancer surgeries [24]. ...
... When applied in real-world settings, the amount of data collected from one disease is usually quite limited, as can be seen from the size of the existing pathological speech datasets [6], [32], [61], [25], [63], [24]. Hence, it can be beneficial when a diagnostic model can generalize to unseen diseases with similar symptoms or pathological origins. ...
Preprint
Full-text available
Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.
Chapter
This chapter presents a corpus named CoRePooL that stands for Corpus for Resource-Poor Languages. As voice-specific human-machine interaction applications are accelerated by deep learning algorithms, the lack of resources constrains the scalability in applying to resource-poor languages. In CoRePooL version 0.1.0, we released 420 min of monolingual supervised corpus and 968 minutes of multilingual unsupervised corpus for the Badaga language from the Dravidian language family. The annotation of supervised corpus helps in performing speech-to-text, text-to-speech, translation, gender, and speaker identification. The unsupervised corpus would help self-supervised algorithms which compute latent representations. We also provided the baseline for all the tasks by fine-tuning the foundation models on the released corpus. The code, models, and data are made publicly available at https://github.com/rbg-research/CoRePooL.
Article
Full-text available
Language is a universal human ability, acquired readily by young children, who otherwise struggle with many basics of survival. And yet, language ability is variable across individuals. Naturalistic and experimental observations suggest that children’s linguistic skills vary with factors like socioeconomic status and children’s gender. But which factors really influence children’s day-to-day language use? Here, we leverage speech technology in a big-data approach to report on a unique cross-cultural and diverse data set: >2,500 d-long, child-centered audio-recordings of 1,001 2- to 48-mo-olds from 12 countries spanning six continents across urban, farmer-forager, and subsistence-farming contexts. As expected, age and language-relevant clinical risks and diagnoses predicted how much speech (and speech-like vocalization) children produced. Critically, so too did adult talk in children’s environments: Children who heard more talk from adults produced more speech. In contrast to previous conclusions based on more limited sampling methods and a different set of language proxies, socioeconomic status (operationalized as maternal education) was not significantly associated with children’s productions over the first 4 y of life, and neither were gender or multilingualism. These findings from large-scale naturalistic data advance our understanding of which factors are robust predictors of variability in the speech behaviors of young learners in a wide range of everyday contexts.
Conference Paper
Full-text available
Habitual snoring and Obstructive Sleep Apnea are serious conditions that can affect the health of the snorer. For a targeted surgical treatment, it is crucial to identify the exact location of the vibration within the upper airways. As opposed to earlier work, we present the first unsupervised feature learning approach to this task based on bags-of-audio-words. Likewise, we cluster feature values within a given time-segment into acoustic 'words'. The frequency of occurrence per such word is then represented in a his-togram per sound chunk to classify between four excitation locations. In extensive test runs based on snore sound data of 24 patients labelled by experts, we evaluated several feature sets as basis for audio word creation. In the result, we find audio words based on wavelet features, formants, and MFCC to be highly suited and outperform previous experiments based on the same data set.
Conference Paper
Full-text available
This paper addresses the problem of sound event classification, focusing on feature extraction methods which are robust in noisy environments. In real world, sound events can be easily exposed in a noisy situation causing corruption of distinctive temporal and spectral characteristics. Therefore, extracting robust features to represent these characteristics is important in achieving good classification performance. In this paper, we employ a combination of local binary pattern (LBP) and histogram of oriented gradient (HOG) which are motivated from image processing technique to capture local characteristics of a spectrogram image in the noisy sound events. Furthermore, a bag-of-audio-words (BoAW) method is also applied to the combination of LBP and HOG to capture global characteristics of the spectrogram image. The proposed method is evaluated on a database consisting hundreds of audio clips for two groups of sound events which are aimed at audio surveillance applications. Test sounds are classified at various noise conditions by using a support vector machine and the proposed method shows over 20% relative improvements in average compared to other conventional feature based BoAW methods.
Article
Full-text available
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Book
This book presents the methods, tools and techniques that are currently being used to recognise (automatically) the affect, emotion, personality and everything else beyond linguistics ('paralinguistics') expressed by or embedded in human speech and language. It is the first book to provide such a systematic survey of paralinguistics in speech and language processing. The technology described has evolved mainly from automatic speech and speaker recognition and processing, but also takes into account recent developments within speech signal processing, machine intelligence and data mining. Moreover, the book offers a hands-on approach by integrating actual data sets, software, and open-source utilities which will make the book invaluable as a teaching tool and similarly useful for those professionals already in the field. Key features: Provides an integrated presentation of basic research (in phonetics/linguistics and humanities) with state-of-the-art engineering approaches for speech signal processing and machine intelligence. Explains the history and state of the art of all of the sub-fields which contribute to the topic of computational paralinguistics. C overs the signal processing and machine learning aspects of the actual computational modelling of emotion and personality and explains the detection process from corpus collection to feature extraction and from model testing to system integration. Details aspects of real-world system integration including distribution, weakly supervised learning and confidence measures. Outlines machine learning approaches including static, dynamic and context-sensitive algorithms for classification and regression. Includes a tutorial on freely available toolkits, such as the open-source 'openEAR' toolkit for emotion and affect recognition co-developed by one of the authors, and a listing of standard databases and feature sets used in the field to allow for immediate experimentation enabling the reader to build an emotion detection model on an existing corpus.
Book
This book reports on an outstanding thesis that has significantly advanced the state-of-the-art in the automated analysis and classification of speech and music. It defines several standard acoustic parameter sets and describes their implementation in a novel, open-source, audio analysis framework called openSMILE, which has been accepted and intensively used worldwide. The book offers extensive descriptions of key methods for the automatic classification of speech and music signals in real-life conditions and reports on the evaluation of the framework developed and the acoustic parameter sets that were selected. It is not only intended as a manual for openSMILE users, but also and primarily as a guide and source of inspiration for students and scientists involved in the design of speech and music analysis methods that can robustly handle real-life conditions.
Article
We present various neural network models that learn to produce music transcriptions directly from audio signals. Instead of employing commonplace processing steps, such as frequency transform front-ends or temporal smoothing, we show that a properly trained neural network can learn such steps on its own while being trained to perform note detection. We demonstrate two models that use raw audio waveforms as input and produce either a probabilistic piano roll output or text in music notation format that can be directly rendered into a score.
Article
Drug-induced sleep endoscopy (DISE) is an upper airway evaluation technique in which fiberoptic examination is performed under conditions of unconscious sedation. Unique information obtained from this 3-dimensional examination of the airway potentially provides additive benefits to other evaluation methods to guide treatment selection. This article presents recommendations regarding DISE technique and the VOTE Classification system for reporting DISE findings and reviews the evidence concerning DISE test characteristics and the association between DISE findings and treatment outcomes.