Conference PaperPDF Available

Speaker discrimination and classification in breath noises by human listeners

Authors:

Abstract

Audible breath noises are frequent companions to speech, occurring roughly every 3 to 4 seconds [1, 2], and may also be present outside of speech during effortful actions [3]. Being a vital function, breathing is arguably less affected by speakers trying to disguise their voice and neural networks have shown promising results on speaker identification based on breath noises [4, 5]. However, breathing has remained largely untapped for forensic purposes, with few exceptions (e.g. [6]). In this paper we want to investigate the potential that breath noises have for speaker discrimination and classification by human listeners. We annotated breath noises in dyadic conversations [7]. For high comparability and since they are most frequent around speech [8], we here use 5 audible oral (and probably simultaneously nasal) inhalations each from 6 younger (age range: 20-29; 3m, 3f) and 6 older (age range: 59-65; 3m, 3f) speakers. These noises were then used as stimuli in two tasks: 1) Discrimination task: participants heard two breath noises (separated by 500 ms of silence; 14 pairs by participant) and were asked whether they were produced by the same speaker or not. We also recorded participants' confidence on a 5-point Likert scale. 2) Speaker classification task: participants listened to one breath noise at a time (20 noises by participant) and were asked whether the breath noise was produced by a young vs old and male vs female speaker and how confident they were in each of these answers. We recruited and paid 33 speakers (22 f, 10 m, 1 other; age range: 20-71, median: 31), who reported wearing headphones in a quiet environment and having no hearing difficulties, via Prolific [9] and ran the experiment on Labvanced [10]. The discrimination task was answered correctly at 64.3 % (sd: 11.8 %), with the lowest results for combinations of different age and same sex. In speaker classification, the speaker's age group was correct at a rate of 50.2 % (sd: 9.1 %), whereas for sex it was 66.7 % (sd: 13.5 %). Confidence did not differ much between tasks or between sex and age in the classification task. The results in both tasks suggest that sex differences are more perceivable than age differences. This general direction seems to be in line with regular speech [11], even though we used ingres-sive, unphonated noises only here and speaker age was a very coarse-grained distinction between two separate groups. Perceivable differences by sex but not age may be related to differences in vocal tract length, which differs by sex [12, p. 25-26]. Age differences may thus be audible when comparing children to adults. Although not very high, these numbers suggest that breath noises may be usable for forensic applications to some extent, given that each individual breath noise used here was only 300 to 1000 ms long. Including breathing patterns, rather than just one or two noises, may add to finding speaker-specific characteristics [13]. These findings have implications for naturalistic synthetic speech and how breath noises there need to be geared to the artificial speaker to be perceived as natural. For forensic purposes, they explore to what extent breath noises may be exploitable for speaker classification and discrimination tasks. It should be borne in mind, however, that all stimuli used here were made under the same recording setup and are thus highly comparable, whereas in real-world forensic applications many factors may complicate comparisons.
Speaker discrimination and classification in breath noises by
human listeners
Raphael Werner, J¨
urgen Trouvain, and Bernd M¨
obius
Language Science and Technology, Saarland University, Saarbr ¨
ucken
Audible breath noises are frequent companions to speech, occurring roughly every 3 to 4 sec-
onds [1, 2], and may also be present outside of speech during effortful actions [3]. Being a vital
function, breathing is arguably less affected by speakers trying to disguise their voice and neural
networks have shown promising results on speaker identification based on breath noises [4, 5].
However, breathing has remained largely untapped for forensic purposes, with few exceptions
(e.g. [6]). In this paper we want to investigate the potential that breath noises have for speaker
discrimination and classification by human listeners.
We annotated breath noises in dyadic conversations [7]. For high comparability and since they
are most frequent around speech [8], we here use 5 audible oral (and probably simultaneously
nasal) inhalations each from 6 younger (age range: 20–29; 3m, 3f) and 6 older (age range: 59–65;
3m, 3f) speakers. These noises were then used as stimuli in two tasks: 1) Discrimination task:
participants heard two breath noises (separated by 500 ms of silence; 14 pairs by participant) and
were asked whether they were produced by the same speaker or not. We also recorded participants’
confidence on a 5-point Likert scale. 2) Speaker classification task: participants listened to one
breath noise at a time (20 noises by participant) and were asked whether the breath noise was
produced by a young vs old and male vs female speaker and how confident they were in each of
these answers. We recruited and paid 33 speakers (22 f, 10 m, 1 other; age range: 20-71, median:
31), who reported wearing headphones in a quiet environment and having no hearing difficulties,
via Prolific [9] and ran the experiment on Labvanced [10].
The discrimination task was answered correctly at 64.3 % (sd: 11.8 %), with the lowest results
for combinations of different age and same sex. In speaker classification, the speaker’s age group
was correct at a rate of 50.2 % (sd: 9.1 %), whereas for sex it was 66.7 % (sd: 13.5 %). Confidence
did not differ much between tasks or between sex and age in the classification task.
The results in both tasks suggest that sex differences are more perceivable than age differences.
This general direction seems to be in line with regular speech [11], even though we used ingres-
sive, unphonated noises only here and speaker age was a very coarse-grained distinction between
two separate groups. Perceivable differences by sex but not age may be related to differences in
vocal tract length, which differs by sex [12, p. 25-26]. Age differences may thus be audible when
comparing children to adults. Although not very high, these numbers suggest that breath noises
may be usable for forensic applications to some extent, given that each individual breath noise
used here was only 300 to 1000 ms long. Including breathing patterns, rather than just one or two
noises, may add to finding speaker-specific characteristics [13].
These findings have implications for naturalistic synthetic speech and how breath noises there
need to be geared to the artificial speaker to be perceived as natural. For forensic purposes, they
explore to what extent breath noises may be exploitable for speaker classification and discrimina-
tion tasks. It should be borne in mind, however, that all stimuli used here were made under the
same recording setup and are thus highly comparable, whereas in real-world forensic applications
many factors may complicate comparisons.
P&P 18, 6-7 October 2022, Bielefeld Licensed under CC BY 4.0
doi:10.11576/pundp2022-1050
References
[1] Am´
elie Rochet-Capellan and Susanne Fuchs. The interplay of linguistic structure and breath-
ing in German spontaneous speech. In Proceedings of the Annual Conference of the Inter-
national Speech Communication Association (Interspeech), pages 2014–2018, 2013.
[2] Laura Lund Kuhlmann and Jenny Iwarsson. Effects of Speaking Rate on Breathing and
Voice Behavior. Journal of Voice, 2021.
[3] J¨
urgen Trouvain and Khiet P Truong. Prosodic characteristics of read speech before and after
treadmill running. In Interspeech 2015, pages 3700–3704, ISCA, 2015. ISCA.
[4] I Sense You by Breath: Speaker Recognition via Breath Biometrics. IEEE Transactions on
Dependable and Secure Computing, 17(2):306–319, 2020.
[5] Wenbo Zhao, Yang Gao, and Rita Singh. Speaker identification from the sound of the human
breath. CoRR, abs/1712.00171, 2017.
[6] Miriam Kienast and Florian Glitza. Respiratory sounds as an idiosyncratic feature in speaker
recognition. In Proceedings of 15th ICPhS, pages 1607–1610, Barcelona, 2003.
[7] R. J.J.H. Van Son, Wieneke Wesseling, Eric Sanders, and Henk Van Den Heuvel. The IFADV
corpus: A free dialog video corpus. Proceedings of the 6th International Conference on
Language Resources and Evaluation, LREC 2008, 2(1):501–508, 2008.
[8] Rosemary A. Lester and Jeannette D. Hoit. Nasal and oral inspiration during natural speech
breathing. Journal of Speech, Language, and Hearing Research, 57(3):734–742, 2014.
[9] Prolific, 2014. Accessed: 17/05/2022.
[10] Holger Finger, Caspar Goeke, Dorena Diekamp, Kai Standvoß, and Peter K¨
onig. Lab-
Vanced: A Unified JavaScript Framework for Online Studies. In 2017 International Confer-
ence on Computational Social Science IC2S2, 2017.
[11] Michael Jessen. Speaker classification in forensic phonetics and acoustics. In Christian
M¨
uller, editor, Speaker Classification I: Fundamentals, Features, and Methods, pages 180–
204. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.
[12] Kenneth N Stevens. Acoustic phonetics, volume 30. MIT press, 2000.
[13] H´
el`
ene Serr´
e, Marion Dohen, Susanne Fuchs, Silvain Gerber, and Am´
elie Rochet-Capellan.
Speech breathing: variable but individual over time and according to limb movements. An-
nals of the New York Academy of Sciences, 1505(1):142–155, 2021.
P&P 18, 6-7 October 2022, Bielefeld Licensed under CC BY 4.0
doi:10.11576/pundp2022-1050
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper examines the speaker identification potential of breath sounds in continuous speech. Speech is largely produced during exhalation. In order to replenish air in the lungs, speakers must periodically inhale. When inhalation occurs in the midst of continuous speech, it is generally through the mouth. Intra-speech breathing behavior has been the subject of much study, including the patterns, cadence, and variations in energy levels. However, an often ignored characteristic is the {\em sound} produced during the inhalation phase of this cycle. Intra-speech inhalation is rapid and energetic, performed with open mouth and glottis, effectively exposing the entire vocal tract to enable maximum intake of air. This results in vocal tract resonances evoked by turbulence that are characteristic of the speaker's speech-producing apparatus. Consequently, the sounds of inhalation are expected to carry information about the speaker's identity. Moreover, unlike other spoken sounds which are subject to active control, inhalation sounds are generally more natural and less affected by voluntary influences. The goal of this paper is to demonstrate that breath sounds are indeed bio-signatures that can be used to identify speakers. We show that these sounds by themselves can yield remarkably accurate speaker recognition with appropriate feature representations and classification frameworks.
Article
Full-text available
Purpose The purpose of this study was to determine the typical pattern for inspiration during speech breathing in healthy adults, as well as the factors that might influence it. Method Ten healthy adults, 18–45 years of age, performed a variety of speaking tasks while nasal ram pressure, audio, and video recordings were obtained. Inspirations were categorized as nasal-only, oral-only, simultaneous nasal and oral, or alternating nasal and oral inspiration. The method was validated using nasal airflow, oral airflow, audio, and video recordings for 2 participants. Results The predominant pattern was simultaneous nasal and oral inspirations for all speaking tasks. This pattern was not affected either by the nature of the speaking task or by the phonetic context surrounding the inspiration. The validation procedure confirmed that nearly all inspirations during counting and paragraph reading were simultaneous nasal and oral inspirations, whereas for sentence reading, the predominant pattern was alternating nasal and oral inspirations across the 3 phonetic contexts. Conclusions Healthy adults inspire through both the nose and mouth during natural speech breathing. This pattern of inspiration is likely beneficial in reducing pathway resistance while preserving some of the benefits of nasal breathing.
Conference Paper
Full-text available
This paper investigates the relation between the linguistic structure of the breath group and breathing kinematics in spontaneous speech. 26 female speakers of German were recorded by means of an Inductance Plethysmograph. The breath group was defined as the interval of speech produced on a single exhalation. For each group several linguistic parameters (number and type of clauses, number of syllables, hesitations) were measured and the associated inhalation was characterized. The average duration of the breath group was ~3.5 s. Most of the breath groups consisted of 1-3 clauses; ~53% started with a matrix clause; ~24% with an embedded clause and ~23% with an incomplete clause (continuation, repetition, hesitation). The inhalation depth and duration varied as a function of the first clause type and with respect to the breath group length, showing some interplay between speech-planning and breathing control. Vocalized hesitations were speaker-specific and came with deeper inhalation. These results are informative for a better understanding of the interplay of speech-planning and breathing control in spontaneous speech. The findings are also relevant for applications in speech therapies and technologies.
Article
Objectives The objective of this study was to investigate the effects of speaking rate (habitual and fast) and speech task (reading and spontaneous speech) on seven dependent variables: Breath group size (in syllables), Breath group duration (in seconds), Lung volume at breath group initiation, Lung volume at breath group termination, Lung volume excursion for each breath group (in % vital capacity), Lung volume excursion per syllable (in % vital capacity) and mean speaking Fundamental frequency (fO). Methods Ten women and seven men were included as subjects. Lung volume and breathing behaviors were measured by respiratory inductance plethysmography and fO was measured from audio recordings by the Praat software. Statistical significance was tested by analysis of variance. Results For both reading and spontaneous speech, the group increased mean breath group size and breath group duration significantly in the fast speaking rate condition. The group significantly decreased lung volume excursion per syllable in fast speech. Females also showed a significant increase of fO in fast speech. The lung volume levels for initiation and termination of breath groups, as well as lung volume excursions in % vital capacity, showed great individual variations and no significant effects of rate. Significant effects of speech task were found for breath group size and lung volume excursion per syllable, where reading induced more syllables produced per breath group and less % VC spend per syllable as compared to spontaneous speech. Interaction effects showed that the increases in breath group size and breath group duration associated with fast rate were significantly larger in reading than in spontaneous speech. Conclusion Our data from 17 vocally untrained, healthy subjects showed great individual variations but still significant group effects regarding increased speaking rate, where the subjects seemed to spend less air per syllable and inhaled less often as a consequence of greater breath group sizes in fast speech. Subjects showed greater changes in breath group patterns as a consequence of fast speech in reading than in spontaneous speech, indicating that effects of speaking rate are dependent on the speech task.
Book
This book presents a theory of speech-sound generation in the human vocal system. This book presents a theory of speech-sound generation in the human vocal system. The comprehensive acoustic theory serves as one basis for defining categories of speech sounds used to form distinctions between words in languages. The author begins with a review of the anatomy and physiology of speech production, then covers source mechanisms, the vocal tract as an acoustic filter, relevant aspects of auditory psychophysics and physiology, and phonological representations. In the remaining chapters he presents a detailed examination of vowels, consonants, and the influence of context on speech-sound production. Although he focuses mainly on the sounds of English, he touches briefly on sounds in other languages. The book will serve as a reference for speech scientists, speech pathologists, linguists interested in phonetics and phonology, psychologists interested in speech perception and production, and engineers concerned with speech processing applications.
Chapter
Speaker classification in forensic phonetics and acoustics is relevant for several practical tasks within this discipline, including voice analysis, voice comparison, and voice lineup. Six domains of speaker characteristics commonly used in forensic speech analysis are addressed: dialect, foreign accent, sociolect, age, gender, and medical conditions. Focussing on gender plus the less-commonly used characteristic of body size, it is argued that while auditory analysis is indispensable in forensic speaker classification, acoustic analysis can provide important additional information.
Prosodic characteristics of read speech before and after treadmill running
  • Jürgen Trouvain
  • P Khiet
  • Truong
Jürgen Trouvain and Khiet P Truong. Prosodic characteristics of read speech before and after treadmill running. In Interspeech 2015, pages 3700-3704, ISCA, 2015. ISCA.
Respiratory sounds as an idiosyncratic feature in speaker recognition
  • Miriam Kienast
  • Florian Glitza
Miriam Kienast and Florian Glitza. Respiratory sounds as an idiosyncratic feature in speaker recognition. In Proceedings of 15th ICPhS, pages 1607-1610, Barcelona, 2003.
Eric Sanders, and Henk Van Den Heuvel. The IFADV corpus: A free dialog video corpus
  • R J J H Van Son
  • Wieneke Wesseling
R. J.J.H. Van Son, Wieneke Wesseling, Eric Sanders, and Henk Van Den Heuvel. The IFADV corpus: A free dialog video corpus. Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008, 2(1):501-508, 2008.
Lab-Vanced: A Unified JavaScript Framework for Online Studies
  • Holger Finger
  • Caspar Goeke
  • Dorena Diekamp
  • Kai Standvoß
  • Peter König
Holger Finger, Caspar Goeke, Dorena Diekamp, Kai Standvoß, and Peter König. Lab-Vanced: A Unified JavaScript Framework for Online Studies. In 2017 International Conference on Computational Social Science IC2S2, 2017.