Article

Inferring speakers’ physical attributes from their voices

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Two experiments examined listeners’ ability to make accurate inferences about speakers from the nonlinguistic content of their speech. In Experiment I, naı̈ve listeners heard male and female speakers articulating two test sentences, and tried to select which of a pair of photographs depicted the speaker. On average they selected the correct photo 76.5% of the time. All performed at a level that was reliably better than chance. In Experiment II, judges heard the test sentences and estimated the speakers’ age, height, and weight. A comparison group made the same estimates from photographs of the speakers. Although estimates made from photos are more accurate than those made from voice, for age and height the differences are quite small in magnitude—a little more than a year in age and less than a half inch in height. When judgments are pooled, estimates made from photos are not uniformly superior to those made from voices.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Existing face-voice matching task evaluations in humans [21,25,39] adopted two-alternative forced-choice (2AFC) evaluation. Thus, we also used it for comparison with the previous works. ...
... Previous works in cognitive science might be not suitable to determine human performance because their motivation was to reveal whether human accuracy of face-voice matching task is higher than chance or not, rather than to specify the accuracy. We show human performance reported in [21,25] in Table 3. Mavica et al. [25] used the same experimental settings as [39], except for the dataset, and achieved the accuracy of 57.0%, which is 4% lower than the results in [39]. The 95% confidence interval in the results of [25,39] indicates that the number of trials is too small to determine human performance; Smith et al. conducted only 360 trials (12 trials per participant × 30 participants), and Mavica et al. conducted 1,600 trials (64 per participant × 25 participants). ...
... Face-voice matching accuracy might be improved by using body images or motions. Krauss et al. [21] used whole body images for Table 3: 2AFC evaluation results of face-voice matching. In the "Train" column, we show the dataset used for training the face-voice matching model in Figure 1, where "M" denotes a set of male samples and "F" denotes a set of female samples. ...
Conference Paper
Face-voice matching is a task to find correspondence between faces and voices. Many researches in cognitive science have confirmed human ability in the face-voice matching tasks. Such ability is useful for creating natural human machine interaction systems and in many other applications. In this paper, we propose a face-voice matching model that learns cross-modal embeddings between face images and voice characteristics. We constructed a novel FVCeleb dataset which consists of face images and utterances from 1,078 persons. These persons were selected from the MS-Celeb-1M face image dataset and the VoxCeleb audio dataset. In two-alternative forced-choice matching task with an audio input and two face-image candidates of the same gender, our model achieved 62.2% and 56.5% accuracy on the FVCeleb and the subset of the GRID corpus, respectively. These results are very similar to human performance reported in cognitive science studies.
... From the field of social psychology we know that listeners constantly make inferences about speakers based on the (non-linguistic) content of speech, engaging in what is called person or speaker perception (Krauss & Pardo, 2006). Listener attributions may range from social status (Brown, Strong, & Rencher, 1975) and emotion (Scherer, 2003) to metacognitive states (Brennan & Williams, 1995) and even to physical properties of a speaker (Krauss, Freyberg, & Morsella, 2002). Nevertheless, it is as yet unknown how the fluency characteristics of native speech contribute to the perception of a native speaker's fluency level. ...
... From the literature on social psychology (Brown et al., 1975;Krauss & Pardo, 2006), we know that listeners assess the speech of others on an everyday basis. People make attributions about speakers' social status, background and even physical properties (Krauss et al., 2002;Krauss & Pardo, 2006). Our results show that individual differences between native speakers in their production of disfluencies carry consequences for listeners' perceptions of a native speaker's fluency level. ...
... This raises the question whether the similarity of native and non-native fluency perception also applies when listeners assess fluency in the broad sense (for instance, in fluency assessment without any instructions on what comprises fluency). This question is very much relevant for everyday situations in which interlocutors in a conversation draw inferences about the other (native or non-native) speaker's social status (Brown et al., 1975), emotion (Scherer, 2003), physical properties (Krauss et al., 2002), metacognitive state (Brennan & Williams, 1995), fluency level (e.g., Chapter 2 and 3), etc. Listeners' considerations in these spontaneous, uncontrolled situations have been under-investigated in the literature and future studies may find ways of tapping listeners' underlying deliberations in these situations. Until that time, it is uncertain whether our conclusions about native and non-native fluency in the narrow sense generalize to situations without clearly formulated fluency assessment instructions. ...
... Research has shown that human listeners can draw inferences about body characteristics of a speaker based solely on hearing the target's voice [42,55,69]. In [42], voicebased estimates of waist-to-hip ratio (WHR) of female speakers predicted the speaker's actual WHR, the estimated shoulder-to-hip ratio (SHR) of male speakers predicted the speaker's actual SHR measurements. ...
... In [42], voicebased estimates of waist-to-hip ratio (WHR) of female speakers predicted the speaker's actual WHR, the estimated shoulder-to-hip ratio (SHR) of male speakers predicted the speaker's actual SHR measurements. In another study, human evaluators estimated the body height and weight of strangers from a voice recording almost as well as they did from a photograph [55]. ...
... While there is a rich and growing body of research to support the above statement, it has to be acknowledged that many of the studies cited in this paper achieved their classification results under ideal laboratory conditions (e.g., scripted speech, high quality microphones, close-capture recordings, no background noise) [10,20,30,36,55,60,70,82,94,107], which may raise doubt about the generalizability of their inference methods. Also, while impressive accuracies have been reached, it should not be neglected that nearly all of the mentioned approaches still exhibit considerable error rates. ...
Chapter
Full-text available
Internet-connected devices, such as smartphones, smartwatches, and laptops, have become ubiquitous in modern life, reaching ever deeper into our private spheres. Among the sensors most commonly found in such devices are microphones. While various privacy concerns related to microphone-equipped devices have been raised and thoroughly discussed, the threat of unexpected inferences from audio data remains largely overlooked. Drawing from literature of diverse disciplines, this paper presents an overview of sensitive pieces of information that can, with the help of advanced data analysis methods, be derived from human speech and other acoustic elements in recorded audio. In addition to the linguistic content of speech, a speaker's voice characteristics and manner of expression may implicitly contain a rich array of personal information, including cues to a speaker's biometric identity, personality, physical traits, geographical origin, emotions, level of intoxication and sleepiness, age, gender, and health condition. Even a person's socioeconomic status can be reflected in certain speech patterns. The findings compiled in this paper demonstrate that recent advances in voice and speech processing induce a new generation of privacy threats.
... Such a crossmodal matching task typically consists of four sequential phases per trial: All stimuli (a reference stimulus in one modality prior to two sequentially presented comparison stimuli in the other modality) are presented one after another, followed by a final decision phase (typically a two-alternative forced choice). The sequence of modalities can be varied (visual or auditory reference stimulus first), and the stimuli consisted either of single words (e.g., Lachs, 1999;Lachs and Pisoni, 2004a,b) or sentences (e.g., Krauss et al., 2002;Smith et al., 2016b). Usually, such a design yielded only chance performance when static pictures of faces were used (but see Mavica and Barenholtz, 2013, Experiment 2, for a notable exception), but above-chance performance when dynamic faces were used as stimuli. ...
... Importantly, however, studies using simultaneous presentation of at least the comparison stimuli revealed clear above-chance matching performance for static faces (Krauss et al., 2002;Mavica and Barenholtz, 2013;Smith et al., 2016b, Experiment 3). Among these studies, Krauss et al. (2002) presented whole bodies of persons (instead of faces), and only Mavica and Barenholtz (2013, Experiment 1) used a completely simultaneous setting, in which the voice stimulus and two visual face stimuli were presented at once. ...
... Importantly, however, studies using simultaneous presentation of at least the comparison stimuli revealed clear above-chance matching performance for static faces (Krauss et al., 2002;Mavica and Barenholtz, 2013;Smith et al., 2016b, Experiment 3). Among these studies, Krauss et al. (2002) presented whole bodies of persons (instead of faces), and only Mavica and Barenholtz (2013, Experiment 1) used a completely simultaneous setting, in which the voice stimulus and two visual face stimuli were presented at once. ...
Article
Full-text available
Previous research has demonstrated that humans are able to match unfamiliar voices to corresponding faces and vice versa. It has been suggested that this matching ability might be based on common underlying factors that have a characteristic impact on both faces and voices. Some researchers have additionally assumed that dynamic facial information might be especially relevant to successfully match faces to voices. In the present study, static and dynamic face-voice matching ability was compared in a simultaneous presentation paradigm. Additionally, a procedure (matching additionally supported by incidental association learning) was implemented which allowed for reliably excluding participants that did not pay sufficient attention to the task. A comparison of performance between static and dynamic face-voice matching suggested a lack of substantial differences in matching ability, suggesting that dynamic (as opposed to mere static) facial information does not contribute meaningfully to face-voice matching performance. Importantly, this conclusion was not merely derived from the lack of a statistically significant group difference in matching performance (which could principally be explained by assuming low statistical power), but from a Bayesian analysis as well as from an analysis of the 95% confidence interval (CI) of the actual effect size. The extreme border of this CI suggested a maximally plausible dynamic face advantage of less than four percentage points, which was considered way too low to indicate any theoretically meaningful dynamic face advantage. Implications regarding the underlying mechanisms of face-voice matching are discussed.
... This prediction was indeed corroborated by studies which involve a simultaneous display of the visual comparison stimuli (note that this procedure is not feasible for auditory comparison stimuli, since simultaneous presentation of two audio tracks will severely hamper the processing of individual voice features): Here, above-chance matching performance was found, and this result was replicated across different research groups (Krauss, Freyberg, & Morsella 2002;Smith et al., 2016a, Exp. 3). Note that some of these studies had specific characteristics: For example, Krauss et al. (2002) displayed not only faces, but complete bodies of model persons. ...
... This prediction was indeed corroborated by studies which involve a simultaneous display of the visual comparison stimuli (note that this procedure is not feasible for auditory comparison stimuli, since simultaneous presentation of two audio tracks will severely hamper the processing of individual voice features): Here, above-chance matching performance was found, and this result was replicated across different research groups (Krauss, Freyberg, & Morsella 2002;Smith et al., 2016a, Exp. 3). Note that some of these studies had specific characteristics: For example, Krauss et al. (2002) displayed not only faces, but complete bodies of model persons. Mavica and Barenholtz (2013, Experiment 1) introduced a special design where not only the two comparison stimuli, but also the reference stimulus were presented simultaneously. ...
... The sequence of modalities can be varied (visual or auditory reference stimulus first), and the stimuli consisted either of single words (e.g., Lachs, 1999;Lachs & Pisoni, 2004) or sentences (e.g., Smith et al., 2016a;Krauss, Freyberg, & Morsella, 2002). Usually, such a design yielded only chance performance when static pictures of faces were used (but see Mavica & Barenholtz, 2013, Experiment 2, for a notable exception), but above-chance performance when dynamic faces were used as stimuli. ...
Thesis
Full-text available
The present thesis addresses cognitive processing of voice information. Based on general theoretical concepts regarding mental processes it will differentiate between modular, abstract information processing approaches to cognition and interactive, embodied ideas of mental processing. These general concepts will then be transferred to the context of processing voice-related information in the context of parallel face-related processing streams. One central issue here is whether and to what extent cognitive voice processing can occur independently, that is, encapsulated from the simultaneous processing of visual person-related information (and vice versa). In Study 1 (Huestegge & Raettig, in press), participants are presented with audio-visual stimuli displaying faces uttering digits. Audiovisual gender congruency was manipulated: There were male and female faces, each uttering digits with either a male or female voice (all stimuli were AV- synchronized). Participants were asked to categorize the gender of either the face or the voice by pressing one of two keys in each trial. A central result was that audio-visual gender congruency affected performance: Incongruent stimuli were categorized slower and more error-prone, suggesting a strong cross-modal interaction of the underlying visual and auditory processing routes. Additionally, the effect of incongruent visual information on auditory classification was stronger than the effect of incongruent auditory information on visual categorization, suggesting visual dominance over auditory processing in the context of gender classification. A gender congruency effect was also present under high cognitive load. Study 2 (Huestegge, Raettig, & Huestegge, in press) utilized the same (gender-congruent and -incongruent) stimuli, but different tasks for the participants, namely categorizing the spoken digits (into odd/even or smaller/larger than 5). This should effectively direct attention away from gender information, which was no longer task-relevant. Nevertheless, congruency effects were still observed in this study. This suggests a relatively automatic processing of cross-modal gender information, which eventually affects basic speech-based information processing. Study 3 (Huestegge, subm.) focused on the ability of participants to match unfamiliar voices to (either static or dynamic) faces. One result was that participants were indeed able to match voices to faces. Moreover, there was no evidence for any performance increase when dynamic (vs. mere static) faces had to be matched to concurrent voices. The results support the idea that common person-related source information affects both vocal and facial features, and implicit corresponding knowledge appears to be used by participants to successfully complete face-voice matching. Taken together, the three studies (Huestegge, subm.; Huestegge & Raettig, in press; Huestegge et al., in press) provided information to further develop current theories of voice processing (in the context of face processing). On a general level, the results of all three studies are not in line with an abstract, modular view of cognition, but rather lend further support to interactive, embodied accounts of mental processing.
... For conversion from face to voice, we put a focus on subjective impressions of face and voice. In [8][9][10][11][12], it was implied that, based on subjective impressions of the face of a person, human subjects can imagine the voice quality of that person. It is very natural, however, that the real voice quality of that person is different from the imagined voice quality. ...
... In Eq. (8), Θ can be initialized by pCCA parameters θ which can also be calculated by using Eq. (8), and θ can be initialized by deterministic values proposed in [21]. ...
... Faces and voices provide a range of overlapping information, including cues to attractiveness, masculinity, femininity, and health (Collins & Missing, 2003;Saxton et al., 2006;Smith et al., 2016a). Several studies have consequently demonstrated that it is possible to match unfamiliar faces and voices across modality with low, but above chance, accuracy (Krauss et al., 2002;Mavica & Barenholtz, 2013;Smith et al., 2016aSmith et al., , 2016bStevenage et al., 2017). Overall, performance is more consistent when matching voices to dynamic faces compared with static faces (Kamachi et al., 2003;Smith et al., 2016b). ...
... Previous face-voice matching studies (e.g., Krauss et al., 2002;Mavica & Barenholtz, 2013;Smith et al., 2016aSmith et al., , 2016bStevenage et al., 2017) have sampled between-person variability, presenting several identities across multiple trials. ...
Article
Full-text available
Unimodal and cross-modal information provided by faces and voices contribute to identity percepts. To examine how these sources of information interact, we devised a novel audio-visual sorting task in which participants were required to group video-only and audio-only clips into two identities. In a series of three experiments, we show that unimodal face and voice sorting were more accurate than cross-modal sorting: While face sorting was consistently most accurate followed by voice sorting, cross-modal sorting was at chancel level or below. In Experiment 1, we compared performance in our novel audio-visual sorting task to a traditional identity matching task, showing that unimodal and cross-modal identity perception were overall moderately more accurate than the traditional identity matching task. In Experiment 2, separating unimodal from cross-modal sorting led to small improvements in accuracy for unimodal sorting, but no change in cross-modal sorting performance. In Experiment 3, we explored the effect of minimal audio-visual training: Participants were shown a clip of the two identities in conversation prior to completing the sorting task. This led to small, nonsignificant improvements in accuracy for unimodal and cross-modal sorting. Our results indicate that unfamiliar face and voice perception operate relatively independently with no evidence of mutual benefit, suggesting that extracting reliable cross-modal identity information is challenging.
... To determine whether AV speech produced faster detection for each facial condition, we evaluated the difference between response times in the AV mode minus the fastest unisensory mode as per the fixed favored dimension model for multidimensional stimuli (e.g., Biederman & Checkosky, 1970;Mordkoff & Yantis, 1993;Stevenson et al., 2014). Both the dynamic and static faces were viewed as multidimensional AV stimuli because individuals can accurately match unfamiliar voices to both dynamic and static unfamiliar faces well above chance; this pattern of results indicates that voices share source-identity information with both types of faces (Krauss, Freyberg, & Morsella, 2002;Mavica & Barenholtz, 2013;H. Smith, Dunn, Baguley, & Stacey, 2016a, 2016b; but see Lachs & Pisoni, 2004). ...
... Performance in 6-to 7-year-olds did not show any influence of either type of face, but performance in 8-to 10-year-olds revealed the minimization of attentional lapses by AV static facial input-an effect that may reflect the simultaneous or correlated onsets interacting to produce a more emphatic onset-alerting signal. As noted previously, voices share source-identity information with both the dynamic and static faces (Krauss et al., 2002;Mavica & Barenholtz, 2013;H. Smith et al., 2016aH. ...
Article
Purpose: Successful speech processing depends on our ability to detect and integrate multisensory cues, yet there is minimal research on multisensory speech detection and integration by children. To address this need, we studied the development of speech detection for auditory (A), visual (V), and audiovisual (AV) input. Method: Participants were 115 typically developing children clustered into age groups between 4 and 14 years. Speech detection (quantified by response times [RTs]) was determined for 1 stimulus, /buh/, presented in A, V, and AV modes (articulating vs. static facial conditions). Performance was analyzed not only in terms of traditional mean RTs but also in terms of the faster versus slower RTs (defined by the 1st vs. 3rd quartiles of RT distributions). These time regions were conceptualized respectively as reflecting optimal detection with efficient focused attention versus less optimal detection with inefficient focused attention due to attentional lapses. Results: Mean RTs indicated better detection (a) of multisensory AV speech than A speech only in 4- to 5-year-olds and (b) of A and AV inputs than V input in all age groups. The faster RTs revealed that AV input did not improve detection in any group. The slower RTs indicated that (a) the processing of silent V input was significantly faster for the articulating than static face and (b) AV speech or facial input significantly minimized attentional lapses in all groups except 6- to 7-year-olds (a peaked U-shaped curve). Apparently, the AV benefit observed for mean performance in 4- to 5-year-olds arose from effects of attention. Conclusions: The faster RTs indicated that AV input did not enhance detection in any group, but the slower RTs indicated that AV speech and dynamic V speech (mouthing) significantly minimized attentional lapses and thus did influence performance. Overall, A and AV inputs were detected consistently faster than V input; this result endorsed stimulus-bound auditory processing by these children.
... Focusing on the transcripts also eliminates other information that the auditory channel would have included, such as the tone, emphasis, and volume of vocal output, which may be associated with perceptions of and attributions about the speakers (Hochschild, 1983;Krauss et al., 2002). ...
... However, isolating elements of the presentations to text transcripts neglects the possibility that having a fuller set of auditory information could be most relevant to evaluators. For example, the auditory channel could include additional information that people might value beyond the content of the business plans, such as which parts of the pitch entrepreneurs emphasized through their voices and thus valued more (Hochschild, 1983;Krauss et al., 2002). In addition, although the use of the same sections from each pitch across conditions controlled for some potential confounds, it could be that those sections were unequally informative in terms of content, resulting in distortions in participant choices toward the visual. ...
Article
Entrepreneurs and investors often deem substantive content to be particularly important as they evaluate the potential value of business propositions. Yet across 12 studies and 1,855 participants using live entrepreneurial pitch competitions, silent videos—but not sound recordings, video-with-sound recordings, or pitch transcriptions—best allowed both experts and novices to identify the original investors' selections of winning entrepreneurial pitches. These results suggest that people’s judgment may be highly influenced by visual information. Further, people do not seem to fully recognize how much visual information factors into their decisions, such that they neglect the more substantive metrics that they explicitly cite and value as core to their decisions. The findings highlight the power of dynamic visual cues—including gestures, facial expressions, and body language—and demonstrate that visible passion can dominate the content of business propositions in entrepreneurial pitch competitions.
... Voices are sometimes described as 'auditory faces', as they carry a wealth of information related to physical and personal characteristics of others [6]. Humans accurately estimate characteristics such as age, weight and height through listening to a voice alone [7,8]. Krauss et al. [7] showed that people match vocal to facial identity in pictures with above 75% accuracy, and estimation of personal characteristics from voices is as accurate as when inspecting photographs. ...
... It is plausible that negative associations originating from past relationships could make people more vigilant and prejudiced towards specific kinds of voices. People seem to be accurate at judging the physical characteristics of others from listening to their voices alone, which might be associated with the connection of voices and physical characteristics learnt through past relationships [8]-for example, by learning to associate more negative and dominant features with certain types of voices. In psychopathology, life experiences have been shown to be strongly associated with the content of hallucinated voices, particularly in relation to childhood trauma (for a review, see [62]). ...
Article
Full-text available
People rapidly make first impressions of others, often based on very little information-minimal exposure to faces or voices is sufficient for humans to make up their mind about personality of others. While there has been considerable research on voice personality perception, much less is known about its relevance to hallucination-proneness, despite auditory hallucinations being frequently perceived as personified social agents. The present paper reports two studies investigating the relation between voice personality perception and hallucination-proneness in non-clinical samples. A voice personality perception task was created, in which participants rated short voice recordings on four personality characteristics, relating to dimensions of the voice's perceived Valence and Dominance. Hierarchical regression was used to assess contributions of Valence and Dominance voice personality ratings to hallucination-proneness scores, controlling for paranoia-proneness and vividness of mental imagery. Results from Study 1 suggested that high ratings of voices as dominant might be related to high hallucination-proneness; however, this relation seemed to be dependent on reported levels of paranoid thinking. In Study 2, we show that hallucination-proneness was associated with high ratings of voice dominance, and this was independent of paranoia and imagery abilities scores, both of which were found to be significant predictors of hallucination-proneness. Results from Study 2 suggest an interaction between gender of participants and the gender of the voice actor, where only ratings of own gender voices on Dominance characteristics are related to hallucination-proneness scores. These results are important for understanding the perception of characterful features of voices and its significance for psychopathology.
... Mais si la voix n'est pas un facteur d'identification stricte, elle reste cependant un canal d'information important concernant l'identité d'une personne : on projette facilement le genre d'une personne, parfois son âge, voire sa taille, ainsi que son origine ou sa classe sociale à partir de sa voix (Krauss et al., 2002). Ainsi, dans une étude Thomas Shipp et Harry Hollien (1969), les auteurs observent que les personnes sont capables d'estimer, à 5 ans près, l'âge d'un locuteur ou d'une locutrice à partir d'une phrase. ...
Thesis
Certains systèmes issus de l'apprentissage machine, de par leurs données et les impensés qu'ils encapsulent, contribuent à reproduire des inégalités sociales, alimentant un discours sur les ``biais de l'intelligence artificielle''. Ce travail de thèse se propose de contribuer à la réflexion collective sur les biais des systèmes automatiques en questionnant l'existence de biais de genre dans les systèmes de reconnaissance automatique de la parole ou ASR (pour Automatic Speech Recognition).Penser l'impact des systèmes nécessite une articulation entre les notions de biais (ayant trait à la constitution du système et de ses données) et de discrimination, définie au niveau de la législation de chaque pays. On considère un système comme discriminatoire lorsqu'il effectue une différence de traitement sur la base de critères considérés comme brisant le contrat social. En France, le sexe et l'identité de genre font partie des 23 critères protégés par la législation.Après une réflexion théorique autour des notions de biais, et notamment sur le biais de prédictif (ou biais de performance) et le biais de sélection, nous proposons un ensemble d'expériences pour tenter de comprendre les liens entre biais de sélection dans les données d'apprentissage et biais prédictif du système. Nous nous basons sur l'étude d'un système HMM-DNN appris sur des corpus médiatiques francophones, et d'un système end-to-end appris sur des livres audio en anglais. Nous observons ainsi qu'un biais de sélection du genre important dans les données d'apprentissage contribue de façon assez partielle au biais prédictif du système d'ASR, mais que ce dernier émerge néanmoins lorsque les données de parole regroupent des situations d'énonciation et des rôles de locuteurs et locutrices différents. Ce travail nous a également conduite à questionner la représentation des femmes dans les données, et plus généralement à repenser les liens entre conception théorique du genre et systèmes d'ASR.
... In addition, we make our code and results available online. 1 2 Related Work Voice versus Privacy Voice is considered to be one of the unique biometric information that has been widely used in various IoT applications. It is a rich resource that discloses several possible states of a speaker, such as emotional state [20], confidence and stress levels, physical condition [15,20,21], age [9], gender, and personal traits. For example, Mairesse et al. [11] proposed classification, regression and ranking models to learn the Big Five personality traits of a speaker. ...
Preprint
Voice controlled devices and services have become very popular in the consumer IoT. Cloud-based speech analysis services extract information from voice inputs using speech recognition techniques. Services providers can thus build very accurate profiles of users' demographic categories, personal preferences, emotional states, etc., and may therefore significantly compromise their privacy. To address this problem, we have developed a privacy-preserving intermediate layer between users and cloud services to sanitize voice input directly at edge devices. We use CycleGAN-based speech conversion to remove sensitive information from raw voice input signals before regenerating neutralized signals for forwarding. We implement and evaluate our emotion filtering approach using a relatively cheap Raspberry Pi 4, and show that performance accuracy is not compromised at the edge. In fact, signals generated at the edge differ only slightly (~0.16%) from cloud-based approaches for speech recognition. Experimental evaluation of generated signals show that identification of the emotional state of a speaker can be reduced by ~91%.
... Proučavajući spomenute značajke, kao i generalnu percepciju glasa neovisno o njegovim akustičkim svojstvima, brojna su istraživanja utvrdila da ljudski glas ima važnu ulogu u otkrivanju govornikovih karakteristika. Primjerice, glas može biti pokazatelj spola (čak sa 96 % točnosti) (Lass, Hughes, Bowyer, Waters i Bourne, 1976), veličine tijela i tjelesne visine (Feinberg, Jones, Little, Burt i Perrett, 2005;Pisanski i sur., 2016;Puts i sur., 2012), zatim dobi (Collins i Missing, 2003;Krauss, Freyberg i Marsella, 2002;Feinberg i sur., 2005), emocionalnoga stanja (Banse i Scherer, 1996), čak i (a)simetrije lica (Hill i sur., 2017). Nadalje, ljudi su uspješni u prepoznavanju identiteta drugih osoba na temelju glasa. ...
Article
Full-text available
Konferencija "Cognitive Science" održala se na Institutu "Jožef Stefan" u Ljubljani 11. listopada 2018. u sklopu 21. Međunarodne multikonferencije "Information Society". Konferenciju "Cognitive Science" organiziralo je Slovensko društvo za kognitivnu znanost. Cilj konferencije bio je povezati stručnjake iz raznih disciplina koje se bave kognicijom te omogućiti razmjenu raznolikih i izazovnih ideja.
... Several sensitive information has been extracted from the voice input such as emotions [26] and health state [20,24]. For example, the age, height, and weight of a speaker can be predicted based solely on hearing his or her voice [11]. Further, the physical strength of the individuals, especially men, can be assessed based only on hearing the sound of their voice [25]. ...
Preprint
Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits. Service providers can build a very accurate profile of a user's demographic category, personal preferences, and may compromise privacy. To address this problem, a privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input. It aims to maintain utility while preserving user privacy. It achieves this by collecting real time speech data and analyzes the signal to ensure privacy protection prior to sharing of this data with services providers. Precisely, the sensitive representations are extracted from the raw signal by using transformation functions and then wrapped it via voice conversion technology. Experimental evaluation based on emotion recognition to assess the efficacy of the proposed method shows that identification of sensitive emotional state of the speaker is reduced by ~96 %.
... Proučavajući spomenute značajke, kao i generalnu percepciju glasa neovisno o njegovim akustičkim svojstvima, brojna su istraživanja utvrdila da ljudski glas ima važnu ulogu u otkrivanju govornikovih karakteristika. Primjerice, glas može biti pokazatelj spola (čak sa 96 % točnosti) (Lass, Hughes, Bowyer, Waters i Bourne, 1976), veličine tijela i tjelesne visine (Feinberg, Jones, Little, Burt i Perrett, 2005;Pisanski i sur., 2016;Puts i sur., 2012), zatim dobi (Collins i Missing, 2003;Krauss, Freyberg i Marsella, 2002;Feinberg i sur., 2005), emocionalnoga stanja (Banse i Scherer, 1996), čak i (a)simetrije lica (Hill i sur., 2017). Nadalje, ljudi su uspješni u prepoznavanju identiteta drugih osoba na temelju glasa. ...
Article
Full-text available
Previous studies have shown that human voice has an important role in communicating different traits, by implying speaker's sex, age, physical height, etc. Studies have also found correlations between various vocal characteristics and perceived personality traits. For example, there is evidence that higher pitch is positively related to perceived femininity, while lower pitch is related to perceived dominance. The aim of the present study was to investigate those relationships between voice and personality, by focusing on women's self-reports of masculinity, femininity, dominance and affiliation. 48 women were recorded three times during vowel /a/ production. After acoustic analysis, it was found that voice pitch was not related to personality traits. On the contrary, pitch variability was negatively related to masculinity, and positively to femininity. Furthermore, shimmer was positively, and harmonics to noise ratio negatively related to self- -reported masculinity. Further regression analyses confirmed contribution of pitch variability and shimmer in explaining individual differences in masculinity. Besides the interpretation of the results in the context of previous findings, we discuss possible directions for future research in order to improve research methodology.
... Apart from conveying traits such as gender or personality, hearing someone's voice for the first time can also make us form a mental image of how that person might look like. For example, it is a common experience to observe "a speaker whose voice is familiar ... and being surprised by that person's appearance" [22]. This confusion, caused by the mismatch between expectation and reality, may adversely effect the ability to for people to establish common ground with robots since, according to Kiesler, a key ingredient for achieving common ground is "to create in people's minds an appropriate doi:10.1109/HRI.2019.8673279 ...
Conference Paper
Full-text available
It is well established that a robot's visual appearance plays a significant role in how it is perceived. Considerable time and resources are usually dedicated to help ensure that the visual aesthetics of social robots are pleasing to users and helps facilitate clear communication. However, relatively little consideration is given to how the voice of the robot should sound, which may have adverse effects on acceptance and clarity of communication. In this study, we explore the mental images people form when they hear robots speaking. In our experiment, participants listened to several voices, and for each voice they were asked to choose a robot, from a selection of eight commonly used social robot platforms, that was best suited to have that voice. The voices were manipulated in terms of naturalness, gender, and accent. Results showed that a) participants seldom matched robots with the voices that were used in previous HRI studies, b) the gender and naturalness vocal manipulations strongly affected participants' selection, and c) the linguistic content of the utterances spoken by the voices does not affect people's selection. This finding suggests that people associate voices with robot pictures, even when the content of spoken utterances was unintelligible. Our findings indicate that both a robot's voice and its appearance contribute to robot perception. Thus, giving a mismatched voice to a robot might introduce a confounding effect in HRI studies. We therefore suggest that voice design should be considered more thoroughly when planning spoken human-robot interactions.
... These accents are perceived as often associated with a lower socioeconomic status (Giles & Billings, 2004;Ohama et al., 2000). It is possible that speakers with different accents may share the same grammar, syntax, and lexicon but still sound different in their usage of language, thus leading to different evaluations by listeners (Giles, 1970 (Fuertes et al., 2011;Krauss, Freyberg, & Morsella, 2002). Accents have been used to evaluate personality types, variations in language use, compliance gaining, social decision making, and other listener behavioral reactions towards accented speakers (for review see Giles & Billings, 2004). ...
Thesis
Full-text available
The present study used social identity theory as a framework in examining the evaluation of non-standard accented speakers from India and Nigeria and whose first language is English. Social identity theory explains one’s awareness that he/she is a member of a certain social group and that such group membership is of value to the individual. Accordingly, the study investigated how social identity influences listeners’ perceptions of non-standard accented speakers’ status, solidarity, and dynamism. And also, if Standard American English (SAE), Indian and Nigerian accents are perceived differently by listeners. A 3 (SAE, Indian accented English, and Nigerian accented English) ✕ 2 (introduction and no introduction) design was employed. 115 Participants from an urban university in the United States participated in an online survey. Participants were randomly assigned to listen to one of six speech samples in experimental conditions (SAE, Indian accent, Nigerian accent, SAE with introduction, Indian accent with introduction, and Nigerian accent with introduction). It was found that SAE, Indian, and Nigerian accents were not significantly evaluated differently in perceived status and dynamism. However, the three accents were evaluated differently in perceived solidarity. The Indian and Nigerian accents were rated higher on solidarity than the SAE. Also, Social identity did not play a significant role in the evaluation of the accents. The implications of this study are discussed in terms of accent attractiveness, interpersonal contact, stereotypes, and language attitudes.
... The voice can convey a great deal of information about a speaker, such as gender (Mullenix, Johnson, Topcudurgun, & Farnsworth, 1995), age (Ringel & Chodzko-Zajko, 1987;Zäske & Schweinberger, 2011), height and weight (Krauss, Freyberg, & Morsella, 2002), emotions (Skuk & Schweinberger, 2013a), social status (Harms, 1961), and personality traits (Zuckerman & Driver, 1989). Voice attractiveness refers to the attractiveness of a person's voice (Zuckerman & Driver, 1989). ...
Article
Objective The aim of the present study was to explore whether people consider their own voice to be more attractive than others and whether the self‐enhancement bias of one's own voice could be generalised to other variants of self‐voice. Method Two experiments were conducted. In Experiment 1, female and male participants were asked to rate the attractiveness of three types of audio recordings (numbers, vowels, words) from same‐sex participants. In Experiment 2, the participants were instructed to rate the attractiveness of six types of audio signals: their own original voice, their recorded voice, a “pitch+20 Hz” audio recording, a “pitch−20 Hz” audio recording, a “loudness+10 dB” audio recording, and a “loudness−10 dB” audio recording. The participants also rated the similarity between the given audio signals and their own voices. Results Experiment 1 showed that the participants rated their own audio recordings as more attractive than others rated their audio recordings, and they rated their own audio recordings as more attractive than those of others. Experiment 2 revealed that the participants rated the recorded voices and the “loudness+/−10 dB” audio recordings as more attractive and similar than the “pitch+/−20 Hz” audio recordings. Conclusions The present study demonstrates that people evaluate their own voices as more attractive than the voices of others and that the self‐enhancement bias of voice attractiveness can be generalised to similar and familiar versions of self‐voice.
... First impressions play a fundamental role in life as they guide our thoughts, affect subsequent behaviours, and, in turn, influence decisions towards a person [1,2]. The human voice is one a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 of the main sources providing first impressions of a speaker's identity, such as gender, race, age, and vocation [3][4][5][6][7][8][9][10], or physical attributes like height and weight, physical strength, or health and fertility [11][12][13][14][15][16]. Furthermore, largely based on non-verbal vocal information (such as pitch and intonation) rather than verbal content (i.e. ...
Article
Full-text available
It has previously been shown that first impressions of a speaker’s personality, whether accurate or not, can be judged from short utterances of vowels and greetings, as well as from prolonged sentences and readings of complex paragraphs. From these studies, it is established that listeners’ judgements are highly consistent with one another, suggesting that different people judge personality traits in a similar fashion, with three key personality traits being related to measures of valence (associated with trustworthiness), dominance, and attractiveness. Yet, particularly in voice perception, limited research has established the reliability of such personality judgements across stimulus types of varying lengths. Here we investigate whether first impressions of trustworthiness, dominance, and attractiveness of novel speakers are related when a judgement is made on hearing both one word and one sentence from the same speaker. Secondly, we test whether what is said, thus adjusting content, influences the stability of personality ratings. 60 Scottish voices (30 females) were recorded reading two texts: one of ambiguous content and one with socially-relevant content. One word (~500 ms) and one sentence (~3000 ms) were extracted from each recording for each speaker. 181 participants (138 females) rated either male or female voices across both content conditions (ambiguous, socially-relevant) and both stimulus types (word, sentence) for one of the three personality traits (trustworthiness, dominance, attractiveness). Pearson correlations showed personality ratings between words and sentences were strongly correlated, with no significant influence of content. In short, when establishing an impression of a novel speaker, judgments of three key personality traits are highly related whether you hear one word or one sentence, irrespective of what they are saying. This finding is consistent with initial personality judgments serving as elucidators of approach or avoidance behaviour, without modulation by time or content. All data and sounds are available on OSF (osf.io/s3cxy).
... Proučavajući spomenute značajke, kao i generalnu percepciju glasa neovisno o njegovim akustičkim svojstvima, brojna su istraživanja utvrdila da ljudski glas ima važnu ulogu u otkrivanju govornikovih karakteristika. Primjerice, glas može biti pokazatelj spola (čak sa 96 % točnosti) (Lass, Hughes, Bowyer, Waters i Bourne, 1976), veličine tijela i tjelesne visine (Feinberg, Jones, Little, Burt i Perrett, 2005;Pisanski i sur., 2016;Puts i sur., 2012), zatim dobi (Collins i Missing, 2003;Krauss, Freyberg i Marsella, 2002;Feinberg i sur., 2005), emocionalnoga stanja (Banse i Scherer, 1996), čak i (a)simetrije lica (Hill i sur., 2017). Nadalje, ljudi su uspješni u prepoznavanju identiteta drugih osoba na temelju glasa. ...
... It is also unclear whether hands evoke stereotypical connotations as is the case for voices. Upon hearing an unfamiliar voice, listeners create a mental image of the speaker's physical appearance (Krauss et al., 2002), and imagine the physical appearance of computergenerated voices (McGinn & Torre, 2019). Moreover, voice identity perception studies demonstrated that listeners attribute traits (trustworthiness or aggression) to unfamiliar voices after brief exposures (100-400 ms; Mileva & Lavan, 2022). ...
Article
Full-text available
Observing someone perform an action automatically activates neural substrates associated with executing that action. This covert response, or automatic imitation , is measured behaviourally using the stimulus–response compatibility (SRC) task. In an SRC task, participants are presented with compatible and incompatible response–distractor pairings (e.g., an instruction to say “ba” paired with an audio recording of “da” as an example of an incompatible trial). Automatic imitation is measured as the difference in response times (RT) or accuracy between incompatible and compatible trials. Larger automatic imitation effects have been interpreted as a larger covert imitation response. Past results suggest that an action’s biological status affects automatic imitation: Human-produced manual actions show enhanced automatic imitation effects compared with computer-generated actions. Per the integrated theory for language comprehension and production, action observation triggers a simulation process to recognize and interpret observed speech actions involving covert imitation. Human-generated actions are predicted to result in increased automatic imitation because the simulation process is predicted to engage more for actions produced by a speaker who is more similar to the listener. We conducted an online SRC task that presented participants with human and computer-generated speech stimuli to test this prediction. Participants responded faster to compatible than incompatible trials, showing an overall automatic imitation effect. Yet the human-generated and computer-generated vocal stimuli evoked similar automatic imitation effects. These results suggest that computer-generated speech stimuli evoke the same covert imitative response as human stimuli, thus rejecting predictions from the integrated theory of language comprehension and production.
... Apart from conveying traits such as gender or personality, hearing someone's voice for the first time can also make us form a mental image of how that person might look like. For example, it is a common experience to observe "a speaker whose voice is familiar ... and being surprised by that person's appearance" [22]. This confusion, caused by the mismatch between expectation and reality, may adversely effect the ability to for people to establish common ground with robots since, according to Kiesler, a key ingredient for achieving common ground is "to create in people's minds an appropriate mental model of the robot automatically" [21]. ...
... Voices are caused by physical visual structures (i.e., the vocal tract) and provide information about the visual characteristics of the speaker. For example, fundamental frequency (i.e., pitch), formant frequencies and vocal-tract resonance (i.e., timbre) map well to, and are predictive of structural form cues, including face-identity (Krauss, Freyberg and Morsella, 2002;Ives, Smith and Patterson, 2005;Smith and Patterson, 2005;Smith et al., 2005;Ghazanfar et al., 2007;Mavica and Barenholtz, 2013;Kim et al., 2019;Oh et al., 2019). This non-arbitrary coupling of sensory information is reflected at the neural level: The face-benefit for auditory-only voice-identity recognition has been shown to be mediated by responses in the fusiform face area (FFA) von Kriegstein et al., 2008; noise, the face-benefit for voice-identity recognition might rely on complementary dynamic face-identity cues processed in the pSTS-mFA, rather than the FFA. ...
Preprint
Full-text available
Recognising the identity of voices is a key ingredient of communication. Visual mechanisms support this ability: recognition is better for voices previously learned with their corresponding face (compared to a control condition). This so-called ‘face-benefit’ is supported by the fusiform face area (FFA), a region sensitive to facial form and identity. Behavioural findings indicate that the face-benefit increases in noisy listening conditions. The neural mechanisms for this increase are unknown. Here, using functional magnetic resonance imaging, we examined responses in face-sensitive regions while participants recognised the identity of auditory-only speakers (previously learned by face) in high (SNR -4 dB) and low (SNR 4 dB) levels of auditory noise. We observed a face-benefit in both noise levels, for most participants (16 of 21). In high-noise, the recognition of face-learned speakers engaged the right posterior superior temporal sulcus motion-sensitive face area (pSTS-mFA), a region implicated in the processing of dynamic facial cues. The face-benefit in high-noise also correlated positively with increased functional connectivity between this region and voice-sensitive regions in the temporal lobe in the group of 16 participants with a behavioural face-benefit. In low-noise, the face-benefit was robustly associated with increased responses in the FFA and to a lesser extent the right pSTS-mFA. The findings highlight the remarkably adaptive nature of the visual network supporting voice-identity recognition in auditory-only listening conditions.
... Schon Pear (1931) fand bei 4000 Radiohörern, dass sie das Alter von neun ihnen unbekannten Sprechern in Radiosendungen relativ genau einschätzen konnten. Andere Studien kamen zu ähnlichen Ergebnissen (Krauss et al., 2002;Winkler, 2009;Harnsberger et al., 2010). Masaki und Seiji (2008) konnten belegen, dass die Stimme -besonders bei Männern -zunächst bis zum ca. ...
Preprint
Polizist*innen führen Gespräche-mit Bürger*innen, mit Kolleg*innen, mit Vorgesetzten und Mitarbeitenden, am Telefon und im direkten Kontakt. Sie verfolgen damit verschiedene Zwecke: sie beruhigen, informieren, belehren, fragen, fordern oder erklären. Als Träger*in der verbalen Kommunikation ist die Stimme damit eines der wichtigsten Einsatzmittel der Polizei. Trotzdem wurde dem Einsatzmittel Stimme in der Polizei bislang wenig Beachtung geschenkt. Merkmale der Stimme und der Sprechweise beeinflussen, welches "Bild" sich Gesprächspartner*innen voneinander machen, also auch, wie Polizist*innen wahrgenommen werden. Auch wenn sich die Sprechenden einander sehen, fließen stimmliche Merkmale in die Abschätzung der Persönlichkeit oder von Kompetenz, Stärke, Durchsetzungsfähigkeit und Vertrauenswürdigkeit ein. Besonders wichtig sind stimmliche Merkmale, um den emotionalen oder kognitiven Zustand des Gegenübers zu beurteilen. Generell wird wohlklingenden Stimmen mehr positive Eigenschaften, mehr Kompetenz, eine höhere Attraktivität und mehr Vertrauenswürdigkeit zugeschrieben. Gleichzeitig ist die Stimme von Polizist*innen im Einsatz Belastungen ausgesetzt. Das Sprechen findet unter ungünstigen Rahmenbedingungen statt: bei Hitze, Kälte, Staub, gegen Lärm oder Wind. Der Beitrag zeigt auf, dass die Stimme durch "Stimmpflege" im Sinne der Gesunderhaltung (z. B. nicht Rauchen), durch eine günstige Körperhaltung, aber auch durch geeignete Stimmübungen gesund, leistungsfähig und wohlklingend erhalten werden kann. Abschließend wird diskutiert, inwiefern die Stimme zukünftig in der Aus-und Fortbildung der Polizei adressiert werden sollte. Hierbei zeigt sich ein möglicher Gegenstandsbereich für polizeiwissenschaftliche Forschung in der Zukunft.
... For example, vocal attractiveness has been found to be associated with a wide array of characteristics (e.g., body size, health, fertility; Zuckerman & Driver, 1989). Although not necessarily accurate, listeners quickly form impressions regarding the physical characteristics (Krauss et al., 2002) or personality traits (Ko et al., 2006;Markel et al., 1972) of speakers based only on their voices. For example, heterosexual women tend to find deeper voices to be more attractive when considering male targets (Hughes & Miller, 2015). ...
Article
Full-text available
Puberphonia refers to a vocal disorder that involves the persistence of a high-pitched voice beyond the age at which vocal maturation is expected to have occurred. We considered the romantic signaling function of the voice by examining whether puberphonia impacted the romantic desirability and perceived attractiveness of the target for short-term and long-term relationships ratings that female raters provided for male targets. We also wanted to examine whether perceptions of these targets (e.g., low levels of masculinity, high levels of femininity) would play a role in the association that puberphonia had with perceived romantic desirability and attractiveness for relationships. Participants were 1,732 heterosexual women who listened to an audio recording of a male target reading a neutral passage either before (pre-treatment) or after (post-treatment) receiving voice therapy to correct his puberphonia. Female raters provided their perceptions of the target after listening to the audio recording. The results revealed that compared with post-treatment targets, pre-treatment targets were rated as being lower in their levels of masculinity, self-esteem, extraversion, and emotional stability but higher in their levels of femininity and agreeableness. In addition, participants rated the pre-treatment targets to be less romantically desirable than post-treatment targets. Perceptions of the targets (e.g., masculinity, self-esteem) mediated the association that puberphonia treatment had with romantic desirability and attractiveness for relationships. Results provide further evidence that vocal characteristics such as pitch may serve as important signals in interpersonal interactions and that men with puberphonia were viewed as less romantically desirable than other individuals, in part, because they are perceived as possessing relatively low levels of masculinity and self-esteem.
... speaking) facial stimuli, but that performance is less likely to be above chance using static faces: For studies contrasting face-voice matching accuracy for dynamic and static faces, some have found that only dynamic face-voice matching is above chance level (Kamachi, Hill, Lander, & Vatikiotis-Bateson, 2003;Lachs & Pisoni, 2004). Others have shown that face-voice matching using static faces is also above chance (Krauss et al., 2002;Mavica & Barenholtz, 2013;Stevenage et al., 2017), particularly when matching procedures have a low memory load (Smith et al., 2016b). Such studies have observed numerical (but not statistical) disadvantages for static faces when compared to matching accuracy for dynamic faces (Smith et al., 2016a, 2016bHuestegge, 2019). ...
Article
Full-text available
Previous studies have shown that face-voice matching accuracy is more consistently above chance for dynamic (i.e. speaking) faces than for static faces. This suggests that dynamic information can play an important role in informing matching decisions. We initially asked whether this advantage for dynamic stimuli is due to shared information across modalities that is encoded in articulatory mouth movements. Participants completed a sequential face-voice matching task with (1) static images of faces, (2) dynamic videos of faces, (3) dynamic videos where only the mouth was visible, and (4) dynamic videos where the mouth was occluded, in a well-controlled stimulus set. Surprisingly, after accounting for random variation in the data due to design choices, accuracy for all four conditions was at chance. Crucially, however, exploratory analyses revealed that participants were not responding randomly, with different patterns of response biases being apparent for different conditions. Our findings suggest that face-voice identity matching may not be possible with above-chance accuracy but that analyses of response biases can shed light upon how people attempt face-voice matching. We discuss these findings with reference to the differential functional roles for faces and voices recently proposed for multimodal person perception.
... • Certaines de ses caractéristiques physiques, tel que la taille ou encore le poids (Bruckert et al., 2006;Krauss et al., 2002) ; ...
Thesis
Full-text available
Cette thèse s’intéresse à l’impact de la dysphonie à travers trois grands axes : la représentation de sa propre voix, la transmission du message et la perception d’autrui. Nous nous basons sur deux populations de femmes professeures des écoles (PE), l’une de 709 PE interrogées via internet et l’autre de 61 locutrices PE enregistrées en conditions contrôlées. À partir d’une évaluation perceptive experte sur l’échelle GRBAS, nos locutrices ont été catégorisées en deux groupes de 37 témoins et 24 dysphoniques légères. Outre les importantes plaintes vocales et l’altération de la qualité de vie qui touchent nos deux populations, nous observons un effet de l’âge des élèves sur la prévalence des troubles vocaux. L’analyse des productions de nos locutrices en lecture calme ou face à une classe bruyante suggère que les PE utilisent des stratégies d’adaptation dans leur pratique professionnelle qui pourraient être impactées par la dysphonie. La dysphonie semble également impacter la transmission de l’information à destination d’élèves de 7 à 10 ans puisque des temps de réaction plus longs sont relevés lors du décodage du contraste de voisement dans une tâche d’identification de mot lorsque la consigne est produite par une locutrice dysphonique. Enfin, suite à une première tâche de catégorisation libre, l’attribution de traits de personnalité par un panel d’auditeurs naïfs se basant uniquement sur la voix des PE met en évidence des profils vocaux associés à des représentations plus ou moins positives. L’accord modéré constaté entre le degré de trouble vocal perçu et l’évaluation experte de la dysphonie semble lié à la perception positive de la raucité par les auditeurs naïfs.
... The moment we start to speak, we automatically reveal information about our biological, psychological, and social status. Research has demonstrated that characteristics, such as a person's gender, age, affect, and their membership in social or ethnic groups, can be inferred from the voice only, even if the person was previously unknown to the judge (Giles et al., 1979;Eagly and Wood, 1982;Kohlberg et al., 1987;Krauss et al., 2002;Pinker, 2003;Tiwari and Tiwari, 2012;Smith et al., 2016). ...
Article
Full-text available
The growing popularity of speech interfaces goes hand in hand with the creation of synthetic voices that sound ever more human. Previous research has been inconclusive about whether anthropomorphic design features of machines are more likely to be associated with positive user responses or, conversely, with uncanny experiences. To avoid detrimental effects of synthetic voice design, it is therefore crucial to explore what level of human realism human interactors prefer and whether their evaluations may vary across different domains of application. In a randomized laboratory experiment, 165 participants listened to one of five female-sounding robot voices, each with a different degree of human realism. We assessed how much participants anthropomorphized the voice (by subjective human-likeness ratings, a name-giving task and an imagination task), how pleasant and how eerie they found it, and to what extent they would accept its use in various domains. Additionally, participants completed Big Five personality measures and a tolerance of ambiguity scale. Our results indicate a positive relationship between human-likeness and user acceptance, with the most realistic sounding voice scoring highest in pleasantness and lowest in eeriness. Participants were also more likely to assign real human names to the voice (e.g., “Julia” instead of “T380”) if it sounded more realistic. In terms of application context, participants overall indicated lower acceptance of the use of speech interfaces in social domains (care, companionship) than in others (e.g., information & navigation), though the most human-like voice was rated significantly more acceptable in social applications than the remaining four. While most personality factors did not prove influential, openness to experience was found to moderate the relationship between voice type and user acceptance such that individuals with higher openness scores rated the most human-like voice even more positively. Study results are discussed in the light of the presented theory and in relation to open research questions in the field of synthetic voice design.
... The apparent links between voice and impressions of speakers' characteristics have been observed for much of human history, at least since Ancient Greece. Previous studies on these links suggest that humans rely on non-verbal vocal information to judge speakers' sex [2], age [3], body size [4], physical strength [5], attractiveness [6], personality [7], professional competence and success [8], among other aspects, though not always with accurate results [9,10]. ...
... The human voice conveys important information about the speaker; listeners can often make adequate judgments regarding the speakers' gender, age, physical attributes and emotional state based on the voice (Coleman, 1976;Hartman & Danhauer, 1976;Krauss, Freyberg, & Morsella, 2002;Lass, Hughes, Bowyer, Waters, & Bourne, 1976;Sauter, Eisner, Ekman, & Scott, 2010;Scherer, 1995;Yogo, Tsutsui, Ando, Hashi, & Yamada, 2000). Likewise, listeners may use vocal cues, such as vocal quality, to make assumptions about the speakers' personality (Aronovitch, 1976;Belin, Boehme, & McAleer, 2017;Lass, Ruscello, Bradshaw, & Blankenship, 1991;Page & Balloun, 1978;Waaramaa, Lukkarila, Järvinen, Geneid, & Laukkanen, 2021). ...
Article
Objective: People with dysphonia are judged more negatively than peers with normal vocal quality. This preliminary study aims to (1) investigate correlations between both auditory-perceptual and objective measures of vocal quality of dysphonic and non-dysphonic speakers and attitudes of listeners, and (2) discover whether these attitudes towards people with dysphonia vary for different types of stimuli: auditory (A) stimuli and combined auditory-visual (AV) stimuli. Visual (V) stimuli were included as a control condition. Method: Ten judges with no experience in the evaluation of dysphonia were asked to rate A, AV and V stimuli of 14 different speakers (10 dysphonic and 4 non-dysphonic speakers) Cognitive attitudes, evaluation of voice characteristics and behavioral attitudes were examined. Pearson and Spearman correlation coefficients were calculated to examine correlations between both Dysphonia Severity Index (DSI) values and perceptual vocal quality as assessed by a speech-language pathologist (PVQSLP) or perceptual vocal quality as assessed by the judges (PVQjudge). Linear mixed model (LMM) analyses were conducted to investigate differences between speakers and stimuli conditions. Results: Statistically significant correlations were found between both perceptual and objective measures of vocal quality and mean attitude scores for A and AV stimuli, indicating increasingly negative attitudes with increasing dysphonia severity. Fewer statistically significant correlations were found for the combined AV stimuli than for A stimuli, and no significant correlations were found for V stimuli. LMM analyses revealed significant group effects for several cognitive attitudes. Conclusion: Generally, people with dysphonia are judged more negatively by listeners than peers without dysphonia. However, the findings of this study suggest a positive influence of visual cues on the judges’ cognitive and behavioral attitudes towards dysphonic speakers. Further research is needed to investigate the significance of this influence.
... This robust benefit of voice-face learning is likely underpinned by perceptual system's sensitivity to the common-cause static identity cues available in both sensory streams. Voices are caused by physical visual structures (i.e., the vocal tract) and static voice properties including vocal-tract resonance (i.e., timbre) and fundamental frequency (i.e., pitch) provide information about the visual structural characteristics of the speaker, including face-identity (Krauss, Freyberg and Morsella, 2002;Ives, Smith and Patterson, 2005;Smith and Patterson, 2005;Smith et al., 2005;Ghazanfar et al., 2007;Mavica and Barenholtz, 2013;Kim et al., 2019;Oh et al., 2019). These causal cross-modal relationships are rapidly acquired (Shams and Seitz, 2008;von Kriegstein et al., 2008), facilitating subsequent auditory-only recognition processing at a speaker specific level (Blank, Kiebel and von Kriegstein, 2015). ...
Preprint
Full-text available
Perception of human communication signals is often more robust when there is concurrent input from the auditory and visual sensory modality. For instance, seeing the dynamic articulatory movements of a speaker, in addition to hearing their voice, can help with understanding what is said. This is particularly evident in noisy listening conditions. Even in the absence of concurrent visual input, visual mechanisms continue to be recruited to optimise auditory processing: auditory-only speech and voice-identity recognition is superior for speakers who have been previously learned with their corresponding face, in comparison to an audio-visual control condition; an effect termed the “face-benefit”. Whether the face-benefit can assist in maintaining robust perception in noisy listening conditions, in a similar manner to concurrent visual input, is currently unknown. Here, in two behavioural experiments, we explicitly examined this hypothesis. In each experiment, participants learned a series of speakers’ voices together with their corresponding dynamic face, or a visual control image depicting the speaker’s occupation. Following learning, participants listened to auditory-only sentences spoken by the same speakers and were asked to recognise the content of the sentences (i.e., speech recognition, Experiment 1) or the identity of the speaker (i.e., voice-identity recognition, Experiment 2) in different levels of increasing auditory noise (SNR +4 dB to -8 dB). For both speech and voice-identity recognition, we observed that for participants who showed a face-benefit, the benefit increased with the degree of noise in the auditory signal (Experiment 1, 2). Taken together, these results support an audio-visual model of human auditory communication and suggest that the brain has developed a flexible system to deal with auditory uncertainty – learned visual mechanisms are recruited to enhance the recognition of the auditory signal.
... ( Abercrombie, 1967;Laver, 1968;Levi, 2021 ) ( Bakir, 2016;Müller, 2006 ) ‫جلمع‬ ‫أكثر‬ ‫من‬ ‫األدلة‬ ‫اف‬ ‫أطر‬ ‫مصدر.‬ ‫من‬ ‫املتحدث‬ ‫جنوسة‬ ‫حتديد‬ ‫على‬ ‫السامع‬ ‫قدرة‬ ‫لفحص‬ ‫اسات‬ ‫در‬ ‫عدة‬ ‫يت‬ ‫أجر‬ ‫لقد‬ ( Preston & Niedzielski, 2010 ) ‫األطفال‬ ‫على‬ ‫بعضها‬ ( Foulkes et al., 2010 ) ‫البالغني‬ ‫على‬ ‫أغلبها‬ ‫و‬ ( Krauss et al., 2002 ) ( Brown et al., 2021 ) . ( Fant, 1956 ) . ...
Article
Full-text available
This study is an attempt to examine the hypothesis that, compared to plains, emphatics negatively affect the bleed-through of indexical information associated with the gender of male and female talkers for Arabic listeners. In a preliminary experiment, Arabic male and female participants were asked to identify the consonant type of the auditorily presented CV(V) and (V)VC fragments/monosyllables as either plains or emphatics. The participants were generally able to identify both types of consonants successfully and accurately. In the main experiment, Arabic speakers and non-Arabic speakers (both male and female) were asked to classify the gender of the male and female Arabic native talkers from the stimuli used in the first experiment. The results of this second task varied between non-Arabic speakers, who surprisingly committed fewer errors in classifying talker gender, and Arabic speakers, who committed more errors. The errors committed by the Arabic speakers were observed in the emphatics condition, especially when the talker was female. These two patterns can be taken as collective evidence that the acoustic characteristics of emphatics may bias listeners’ perception of indexical information towards the male voice, which makes talker gender classification more difficult, especially in critical scenarios such as in forensic phonetic casework. Keywords: acoustics, emphatics, forensic phonetics, indexical information, gender classification
... In Experiment 2, speech stereotypicality activated expectations about phenotypicality, such that strongly stereotypical Black voices were associated with more phenotypically Black faces, and weakly stereotypical Black voices were associated with less phenotypically Black faces. While some physical characteristics do influence how speakers sound, such as height and weight (Krauss et al., 2002) or agerelated changes to the structure of the vocal tract (Caruso et al., 1995), many of the associations between voice and appearance are informed by the surrounding cultural context. Given that there is no inherent reason why more stereotypical-sounding Black speakers should also look more stereotypically Black, the results from Experiment 2 suggest that the linguistic cues that listeners have associated with being more stereotypically Black activate expectations about the speaker's appearance. ...
Article
Full-text available
Black Americans who are perceived as more racially phenotypical—that is, who possess more physical traits that are closely associated with their race—are more often associated with racial stereotypes. These stereotypes, including assumptions about criminality, can influence how Black Americans are treated by the legal system. However, it is unclear whether other forms of racial stereotypicality, such as a person’s way of speaking, also activate stereotypes about Black Americans. We investigated the links between speech stereotypicality and racial stereotypes (Experiment 1) and racial phenotype bias (Experiment 2). In Experiment 1, participants listened to audio recordings of Black speakers and rated how stereotypical they found the speaker, the likely race and nationality of the speaker, and indicated which adjectives the average person would likely associate with this speaker. In Experiment 2, participants listened to recordings of weakly or strongly stereotypical Black American speakers and indicated which of two faces (either weakly or strongly phenotypical) was more likely to be the speaker’s. We found that speakers whose voices were rated as more highly stereotypical for Black Americans were more likely to be associated with stereotypes about Black Americans (Experiment 1) and with more stereotypically Black faces (Experiment 2). These findings indicate that speech stereotypicality activates racial stereotypes as well as expectations about the stereotypicality of an individual’s appearance. As a result, the activation of stereotypes based on speech may lead to bias in suspect descriptions or eyewitness identifications.
Article
Reactions to earnings calls are sensitive to subtle features of managers' speech, but little is known about the effect of nonnative accents in this setting. Nonnative-accented CEOs may avoid holding calls in English for fear of investors' negative stereotypes. However, theory indicates that stereotypes from the CEO position and nonnative accents conflict, and that the process of reconciling conflicting stereotypes requires effortful processing. We use a series of four experiments to test each link of the causal chain that we hypothesize based on this theory. We demonstrate that motivated investors reconcile conflicting stereotypes by inferring exceptional qualities, such as hard work and determination, that positively affect their impressions of nonnative-accented CEOs and, hence, of the company as an investment. We also show that, because bad news stimulates effortful processing, investors receiving bad (versus good) news are more likely to form a positive image of nonnative-accented CEOs and their companies. Data Availability: Contact the authors.
Article
Full-text available
Listeners are able to very approximately estimate speakers’ ages, with a mean estimation error of around ten years. Interestingly, accuracy varies considerably, depending on a number of social aspects of both speaker and listener, including age, gender and native language or language variety. The present study considers the effects of four factors on age perception. It investigates whether there is a main effect of speakers’ native language (Arabic, Korean and Mandarin) even when speaking a second language, English. It also investigates a particular speaker-listener relationship, namely the degree of linguistic familiarity. Linguistic familiarity was expected to be greater between Mandarin and Korean than between Mandarin or Korean and Arabic. In addition, it considers the effect of the acoustic cues of mean fundamental frequency (F0) and speech rate on age estimates. Fifteen Arabic-accented, fifteen Korean-accented and twenty Mandarin-accented English speakers participated as listeners. They heard audio stimuli produced by forty-eight speakers, equally distributed between native Arabic, Korean and Mandarin speakers, reading a short passage in English. Listeners were instructed to estimate speakers’ ages in years. Listeners’ age estimates and reaction times were recorded. Results indicate a significant main effect of speaker native language on perceived age such that Mandarin speakers were estimated to be younger than Arabic speakers. There was also a significant effect of linguistic familiarity on age estimation accuracy. Age estimates were more accurate with greater linguistic familiarity, i.e., native Korean and Mandarin listeners estimated ages of speakers of their own native languages more accurately than native Arabic speakers’ ages and vice versa. In terms of acoustic cues, mean F0 and speech rate were significant predictors of age estimation. These effects suggest that in perception, age may be marked not only by biological changes that occur over the lifetime, but also by language-specific socio-cultural features.
Article
Current research shows that listeners are generally accurate at estimating speakers' age from their speech. This study investigates the effect of speaker first language and the role played by such speaker characteristics as fundamental frequency and speech rate. In this study English and Japanese first language speakers listened to English- and Japanese-accented English speech and estimated the speaker's age. We find the highest correlation between real and estimated speaker age for English listeners listening to English speakers, followed by Japanese listeners listening to both English and Japanese speakers, with English listeners listening to Japanese speakers coming last. We find that Japanese speakers are estimated to be younger than the English speakers by English listeners, and that both groups of listeners estimate male speakers and speakers with a lower mean fundamental frequency to be older. These results suggest that listeners rely on sociolinguistic information in their speaker age estimations and language familiarity plays a role in their success.
Chapter
Speech is one of the most important modes to communicate and interact in human–human interaction (HHI). It contains semantic and pragmatic meaning, often in an underspecified and indirect way, by referencing to situational and world knowledge.
Article
With the development and increasing deployment of smart home devices, voice control supports comfortable end user interactions. However, potential end users may refuse to use Voice-controlled Digital Assistants (VCDAs) because of privacy concerns. To address these concerns, some manufacturers provide limited privacy-preserving mechanisms for end users; however, these mechanisms are seldom used. We herein provide an analysis of privacy threats resulting from the utilization of VCDAs. We further analyze how existing solutions address these threats considering the principles of the European General Data Protection Regulation (GDPR). Based on our analysis, we propose directions for future research and suggest countermeasures for better privacy protection.
Article
Full-text available
Previous western studies revealed a two-dimensional model (valence and dominance) in voice impressions. To explore the cross-cultural validity of this model, the present study recruited Chinese participants to evaluate other people’s personality from recordings of Chinese vocal greeting word “Ni Hao”. Principal Component Analysis (PCA) with Varimax Rotation and Parallel Analysis was used to investigate the dimensions underlying personality judgments. The results also revealed a two-dimensional model: approachability and capability. The approachability dimension was similar to the valence dimension reported in a previous study. It indicated that the approachability/valence dimension has cross-cultural commonality. Unlike the dimension of dominance which was closely related to aggressiveness, the dimension of capability emphasized the social aspects of capability such as intellectuality, social skills, and tenacity. In addition, the acoustic parameters that were used to infer the personality of speakers, as well as the relationship between vocal attractiveness and the personality dimensions of voice, were also partially different from the findings in Western culture.
Article
Full-text available
Recognising the identity of voices is a key ingredient of communication. Visual mechanisms support this ability: recognition is better for voices previously learned with their corresponding face (compared to a control condition). This so-called ‘face-benefit’ is supported by the fusiform face area (FFA), a region sensitive to facial form and identity. Behavioural findings indicate that the face-benefit increases in noisy listening conditions. The neural mechanisms for this increase are unknown. Here, using functional magnetic resonance imaging, we examined responses in face-sensitive regions while participants recognised the identity of auditory-only speakers (previously learned by face) in high (SNR −4 dB) and low (SNR +4 dB) levels of auditory noise. We observed a face-benefit in both noise levels, for most participants (16 of 21). In high-noise, the recognition of face-learned speakers engaged the right posterior superior temporal sulcus motion-sensitive face area (pSTS-mFA), a region implicated in the processing of dynamic facial cues. The face-benefit in high-noise also correlated positively with increased functional connectivity between this region and voice-sensitive regions in the temporal lobe in the group of 16 participants with a behavioural face-benefit. In low-noise, the face-benefit was robustly associated with increased responses in the FFA and to a lesser extent the right pSTS-mFA. The findings highlight the remarkably adaptive nature of the visual network supporting voice-identity recognition in auditory-only listening conditions.
Chapter
Full-text available
The purpose of this short paper is to provide an overview about possible directions of researching the extralinguistic layer of speech. Based on the ideas of Lyons (communicative and informative dimensions of speech) and Laver (linguistic, paralinguistic and extralinguisitic layers), three types of research areas and applications are outlined. Reserach areas include finding correlations between (1) speech parameters and speaker traits, (2) actual and perceived traits of speakers, and (3) peceived speaker traits and speech parameters. Type (1) research results suggest applications in medical diagnosis and forensic speaker identification. Type (2) results can be used to analyse stereotypes in speaker perception (halo effect) if the perceived voice quality influence attuitudes (communication style, prejudices, etc.) to the speaker. Raising awareness about such stereotypes may help improve human relations. Type (3) research may help understand which speech parameters listeners use in forming their impressions about the speaker. These impressions often serve as a basis to decisions, for example, when customers make different decisions because they hear different voices in the ads. Research results also suggest that certain speech parameters even influence electors’ choices.
Article
Extralegal factors such as accent status, race and age may affect how someone is perceived in courtrooms. Even eyewitnesses who are not on trial may be rated less favorably as a result of such features. The current study measured accent status, race and age with 254 participants listening to oral witness statements. Results indicate eyewitnesses with higher-status accents were rated more favorably than those with lower-status accents and younger black eyewitnesses were rated higher than older black witnesses. White eyewitnesses were more favorably rated than black witnesses although this was qualified by results suggesting anti-norm deviance. The findings provide the criminal justice system with reasons to question how interactions among witness characteristics and with observer characteristics may influence court decisions.
Article
Full-text available
The article focuses on the speaker evaluation experiment conducted in the spring of 2016 in Vilnius schools with Russian as the language of instruction. The aim of the experiment was to reveal the students’ subconscious attitudes (evaluations) and determine whether the four speaking styles of Vilnius, which had been distinguished conventionally for the purposes of the research, were recognized by the respondents and what social meanings the styles were associated with. The same experiment was conducted in 2014 in Vilnius schools with Lithuanian as the language of instruction. The study proved the hypothesis that there was a clear hierarchy of the speech styles differentiated by the variants of /i/, /u/, /i + R/, /u + R/ of different duration used in a stressed position. The styles are socially significant to ethnic Lithuanian school students and function as markers of social personality types associated with different personality traits, professions and ethnicity. This year’s experiment is based on the assumption that the social stigma created by standardization ideology and associated with Slavic speakers has affected the subconscious attitudes of students from Russian schools so much that Vilnius speech styles will evoke to them similar associations to those of the students of Lithuanian origin; in other words, phonetic variants which distinguish the styles are likely to identify the same social types of speakers. The research has proved the initial hypothesis. The style Kam+GalSL used by Vilnius city dwellers of Slavic origin tends to be perceived as revealing a Slavic background but does not serve as a marker of high social status and high professional competence. Therefore, even though the participants of the experiment attend Russian schools, their linguistic attitudes are not lingo-centric, namely, they are involved in the same field of social meanings as the Lithuanian school students (such social meanings as non-Lithuanian, less educated, having a poorer job are chosen when reflecting on the Slavic pronunciation). Therefore, the respondents may apply the same ideological scheme on the subconscious level while evaluating the speech of a group to which they belong according to the distinguished features of stimuli. Additional social meanings of this style include otherness (weird), poor communicational skills (poor speaker), low social status and working-class professions indicating meanings (laborer, janitor, market dealer). It seems that the variability of duration in stressed /i/, /u/, /i + R/, /u + R/, which is typical of Lithuanian city dwellers in Vilnius, acquires a different value among Russians speakers in Vilnius. The Kam speaking style, originating from a dialect and distinguished by phonetic variants, is associated with a lower social value in comparison with the styles Kam+GalLT and Neu, which include strongly stigmatized phonetic variants, associated with the speech of Vilnius city dwellers. Both styles Kam+GalLT and Neu are associated with a social type of a speaker of high social status, substantial income, leading positions and high professional competence; however, their sub-types of association are different. Representatives of the Kam speaking style are characterized as provincial, of lower status, working-class professions and representatives of the services area.
Article
In this article, we discuss the threat to privacy that this passive data collection creates, along with opportunities to mitigate this risk. Furthermore, we argue that the use case of BWCs at work will stimulate the development of solutions that prevent the collection of data that could infringe upon the privacy of the wearer. Finally, we discuss the desirable properties of privacy-enhancing technologies (PETs) for BWC
Article
This article describes two experiments investigating listeners’ accuracy in estimation of speaker age as well as the listeners’ confidence that their estimates were correct. In Experiment 1, listeners made age estimates based on spontaneous speech. In Experiment 2, the estimates were based on read speech. The purpose of the study was to explore differences in accuracy and confidence depending on speech material, speaker characteristics (gender and age) and listener gender. Another purpose was to examine the realism in the listeners’ confidence ratings in estimations of spontaneous versus read speech. No differences in accuracy or confidence were found due to speech material type. Although accuracy was higher in estimates of male speakers, confidence was higher in estimates of female speakers, effects that were also dependent on speaker age. Possible acoustic and linguistic explanations behind the age and gender effects are discussed. As the correlation between confidence and accuracy was weak, it was concluded that confidence should not be relied on as an indicator of accuracy in estimation of speaker age.
Article
Voice modulation is important when navigating social interactions-tone of voice in a business negotiation is very different from that used to comfort an upset child. While voluntary vocal behavior relies on a cortical vocomotor network, social voice modulation may require additional social cognitive processing. Using functional magnetic resonance imaging, we investigated the neural basis for social vocal control and whether it involves an interplay of vocal control and social processing networks. Twenty-four healthy adult participants modulated their voice to express social traits along the dimensions of the social trait space (affiliation and competence) or to express body size (control for vocal flexibility). Naïve listener ratings showed that vocal modulations were effective in evoking social trait ratings along the two primary dimensions of the social trait space. Whereas basic vocal modulation engaged the vocomotor network, social voice modulation specifically engaged social processing regions including the medial prefrontal cortex, superior temporal sulcus, and precuneus. Moreover, these regions showed task-relevant modulations in functional connectivity to the left inferior frontal gyrus, a core vocomotor control network area. These findings highlight the impact of the integration of vocal motor control and social information processing for socially meaningful voice modulation.
Article
Full-text available
Discusses M. Rokeach's (see PA, Vol. 35:734) belief theory of prejudice which states that racial prejudice is the result of the anticipation of belief differences. The unidirectional causal relationship implied is criticized as oversimplified. Research supporting the belief theory is examined, with conceptual and experimental deficiences noted. A new formulation is proposed which emphasizes mutual causality between racial prejudice and anticipated belief differences. 2 studies were conducted with 56 white women's club members and 120 white undergraduates. Belief communications were presented as tape-recorded interviews or speeches, with the race and social class of the communicator 1st having been manipulated. The interrelationships between communicator's race, specific communication topic, and S's prejudice level on the dimensions of felt similarity to the communicator support the mutual causation formulation. (26 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
PsyScope is an integrated environment for designing and running psychology experiments on Macintosh computers. The primary goal of PsyScope is to give both psychology students and trained researchers a tool that allows them to design experiments without the need for programming. PsyScope relies on the interactive graphic environment provided by Macintosh computers to accomplish this goal. The standard components of a psychology experiment—groups, blocks, trials, and factors—are all represented graphically, and experiments are constructed by working with these elements in interactive windows and dialogs. In this article, we describe the overall organization of the program, provide an example of how a simple experiment can be constructed within its graphic environment, and discuss some of its technical features (such as its underlying scripting language, timing characteristics, etc.). PsyScope is available for noncommercial purposes free of charge and unsupported to the general research community. Information about how to obtain the program and its documentation is provided.
Article
Full-text available
Most theories of spoken word identification assume that variable speech signals are matched to canonical representations in memory. To achieve this, idiosyncratic voice details are first normalized, allowing direct comparison of the input to the lexicon. This investigation assessed both explicit and implicit memory for spoken words as a function of speakers' voices, delays between study and test, and levels of processing. In 2 experiments, voice attributes of spoken words were clearly retained in memory. Moreover, listeners were sensitive to fine-grained similarity between 1st and 2nd presentations of different-voice words, but only when words were initially encoded at relatively shallow levels of processing. The results suggest that episodic memory traces of spoken words retain the surface details typically considered as noise in perceptual systems.
Article
Conducted 2 studies on speech samples from 32 male college students. In Exp I it was shown that the average voice fundamental frequency of Ss was higher when lying than when telling the truth. In Exp II, judges rated the truthfulness of 64 true and false utterances either from an audiotape that had been electronically filtered to render the semantic content unintelligible or from an unfiltered tape. The truthfulness ratings of judges who heard the content-filtered tape were negatively correlated with fundamental frequency, whereas for the unfiltered condition, truthfulness ratings were uncorrelated with pitch. Although ratings made under the 2 conditions did not differ in overall accuracy, accuracy differences were found that depended on how an utterance had been elicited originally. (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Article
"In a study of social distance of college students with respect to various social objects, a factorial design with two levels of value of race, social class, religion, and nationality was employed and analyses of variance were computed on social distance scores. For white Ss race and social class were found to be more important determinants of social distance than religion or nationality… . The data are interpreted in terms of a theory of prejudice that employs conformity, cognitive dissonance, and insecurity as its main constructs." (31 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In order to induce stress in an experimental subject, a task involving the addition of numbers under time pressure was developed. The subject was required to read six meters and to announce the sum of his readings, together with a test phrase. By controlling the duration of the meter display, the experimenter could vary the level of stress induced in the subject. For each of 10 subjects, numerous verbal responses were obtained while the subject was under stress and while he was relaxed. Contrasting responses containing the same test phrase were assembled into paired‐comparison listening tests. Listeners could identify the stressful responses of some subjects with better than 90% accuracy and of others only at chance level. The test phrases from contrasting responses were analyzed with respect to level and fundamental frequency, and spectrograms of these test phrases were examined. The results indicate that task‐induced stress can produce a number of characteristic changes in the acoustic speech signal. Most of these changes are attributable to modifications in the amplitude, frequency, and detailed waveform of the glottal pulses. Other changes result from differences in articulation. Although the manifestations of stress varied considerably from subject to subject, the test phrases of most subjects exhibited some consistent effects.
Article
The importance of one's speech as an indicator of social status has received little attention in America. Several writers of books on social stratification have even suggested that the speech differences between members of upper and lower classes are very subtle and inconsequential. This article presents research evidence to the contrary, based primarily on three research projects conducted by the present author and two by other researchers. The findings suggest that persons' social status is revealed by their voice—even when content-free speech is used, e.g., counting from one to 20. Persons speaking one regional dialect of American English can identify the social status of persons speaking different dialects. The research also attempts to isolate the various speech qualities which reveal one's social status and to investigate the ability of speakers to disguise these qualities.
Article
Tape recordings of telephone conversations of Consolidated Edison’s system operator (SO) and his immediate superior (CSO), beginning an hour before the 1977 New York blackout, were analyzed for indications of psychological stress. (SO was responsible for monitoring and switching power loads within the Con Ed network.) Utterances from the two individuals were analyzed to yield several pitch and amplitude statistics. To assess the perceptual correlates of stress, four groups of listeners used a seven‐point scale to rate the stress of SO and CSO from either randomized vocal utterances or transcripts of the randomized utterances. Results indicated that whereas CSO’s vocal pitch increased significantly with increased situational stress, SO’s pitch decreased. Listener ratings of stress from the voice were positively related to average pitch. It appears that listeners’ stereotype of psychological stress includes elevated pitch and amplitude levels, as well as their increased variability.
Article
"In a study of social distance of college students with respect to various social objects, a factorial design with two levels of value of race, social class, religion, and nationality was employed and analyses of variance were computed on social distance scores. For white Ss race and social class were found to be more important determinants of social distance than religion or nationality . . .. The data are interpreted in terms of a theory of prejudice that employs conformity, cognitive dissonance, and insecurity as its main constructs." (31 ref.) (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Article
Studied the effects of the physiological condition (PC) of the speaker on the rating of age provided by the listener by evaluating 30 males (aged 25–35 yrs, 45–55 yrs, and 65–75 yrs) in terms of PC. Speech recordings were made as Ss produced sustained vowel sounds and read a standard passage. 58 female and 2 male listeners with normal hearing (aged 20–35 yrs) rated the sustained vowels or the speech samples for age of the speaker. For vowels alone there was no statistically significant relationship between age ratings and actual age. For sentence material the relationship between perceived and actual age was significant for those in poorer PC; for these readers, phonation was not stable and spectral noise was apparent. Results indicate that there is evidently some reliability in the ability of the listener to gauge the S's age based on connected speech, particularly if the S is in PC. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
80 Black and 74 White college students assigned traits, from a list of 80, to the Black lower class, Black middle class, White lower class, and White middle class. Each S rated the 5 or fewer traits that he or she had chosen as being most typical of the respective race–class groups from –5 (unfavorable) to +5 (favorable) for the given groups. Ss also assigned themselves to 1 of 4 classes: lower class, working class, middle class, or upper class. On the basis of these judgments, the Ss within each racial group were classified as perceiving themselves to be above or below the median of their own race's distribution. White Ss assigned more favorable characteristics to the middle than to the lower class and did not rate Blacks lower than Whites. Black Ss made a similar, but smaller, social class distinction and, in addition, generally perceived Blacks more favorably than Whites. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Re-evaluated the data by N. J. Lass et al (see record 1981-07022-001) to determine whether listeners are able to identify accurately the heights and weights of speakers when presented with only their voices. This claim cannot be maintained; although the mean of estimated heights and weights for a group of male speakers was larger than that for a group of female speakers, there is no correlation between estimated size and actual size for individual speakers within their data set. However, the speakers' voices contain features that are incorrectly used by listeners as indicators of body height and weight. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Ten experiments were carried out in attempting to discover the accuracy of listeners' judgments of various personal characteristics from the voice. Two were regular radio broadcasts, while six others were simulated broadcasts in the laboratory. The final two compared the natural voice with the same voice heard through a loudspeaker. Judgments were made in all cases by a matching technique, and only such traits were used as could be determined by some objective test or criterion. The results for physical characteristics of age, height, complexion, physical appearance in photographs, appearance in person, and handwriting all showed consistent small but positive relationships between ratings and actual traits. Judgments on vocation, political preference, extroversion, ascendance, and dominant interest showed more general agreement than did judgments for the physical traits, but they were not more generally correct. Listeners to the actual voice gave results about 7% better than listeners to the radio voice. Free descriptions by listeners and actual summary sketches showed that more accurate judgments were made in terms of such total pictures. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Investigates the social importance of the individual's speech style, discussing "linguistic norms" with reference to a variety of cultures and research sources. Endogenous and exogenous factors in speech style are discussed, and a tentative theory to explain speech modification is proposed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The primary purpose of this paper is to examine age-related changes in the mechanisms associated with two major components of communication: speech and voice. A secondary purpose is to discuss the impact of these changes on nonspeech functions (e.g., chewing and swallowing) associated with the orofacial mechanism (the same mechanism involved in articulation of sounds and syllables). Central to these perspectives is a focus on communication behavior as a critical tool for the older adult in the process of life adjustment. Knowledge of gerontological communication behaviors is important to physical therapists, occupational therapists and other professionals who must maximize successful communication with older clients.
Article
Two studies on speech samples from 32 male college students are reported. In the first, it was shown that the average voice fundamental frequency of the subjects was higher when lying than when telling the truth. In the second, judges rated the truthfulness of 64 true and false utterances either from an audiotape that had been electronically filtered to render the semantic content unintelligible or from an unfiltered tape. The truthfulness ratings of the judges who heard the content-filtered tape were negatively correlated with fundamental frequency, whereas for the unfiltered condition, truthfulness ratings were uncorrelated with pitch. Although raings made under the two conditions did not differ in overalll accuracy, accuracy differences were found that depended on how an utterance had been elicited originally.
Article
The purpose of this investigation was to determine if listeners were capable of speaker height and weight identifications from recorded speech samples. A standard prose passage was recorded by 30 speakers, 15 females and 15 males. A master tape containing the randomly arranged recorded readings of all speakers was played to a group of 30 subjects for speaker height and weight identification purposes. All subjects participated in two experimental sessions. In one session they were asked to determine the height of each of the speakers on the tape, and in another session weight judgments were made. The order of presentation of the height and weight tasks was randomized so that 15 subjects made height judgments first while 15 subjects made weight judgments first. A multiple choice response sheet containing four choices for the judgment of height and weight for each speaker was provided. Results indicate that the subjects were capable, with slightly better than chance guessing accuracy, of identifying the heights of male and female speakers and the weights of male speakers when presented with only their recorded speech samples. Implications of these findings and suggestions for future research are discussed.
Article
Voice quality variations include a set of voicing sound source modifications ranging from laryngealized to normal to breathy phonation. Analysis of reiterant imitations of two sentences by ten female and six male talkers has shown that the potential acoustic cues to this type of voice quality variation include: (1) increases to the relative amplitude of the fundamental frequency component as open quotient increases; (2) increases to the amount of aspiration noise that replaces higher frequency harmonics as the arytenoids become more separated; (3) increases to lower formant bandwidths; and (4) introduction of extra pole zeros in the vocal-tract transfer function associated with tracheal coupling. Perceptual validation of the relative importance of these cues for signaling a breathy voice quality has been accomplished using a new voicing source model for synthesis of more natural male and female voices. The new formant synthesizer, KLSYN88, is fully documented here. Results of the perception study indicate that, contrary to previous research which emphasizes the importance of increased amplitude of the fundamental component, aspiration noise is perceptually most important. Without its presence, increases to the fundamental component may induce the sensation of nasality in a high-pitched voice. Further results of the acoustic analysis include the observations that: (1) over the course of a sentence, the acoustic manifestations of breathiness vary considerably--tending to increase for unstressed syllables, in utterance-final syllables, and at the margins of voiceless consonants; (2) on average, females are more breathy than males, but there are very large differences between subjects within each gender; (3) many utterances appear to end in a "breathy-laryngealized" type of vibration; and (4) diplophonic irregularities in the timing of glottal periods occur frequently, especially at the end of an utterance. Diplophonia and other deviations from perfect periodicity may be important aspects of naturalness in synthesis.
Article
This paper describes some further attempts to identify and measure those parameters in the speech signal that reflect the emotional state of a speaker. High‐quality recordings were obtained of professional “method” actors reading the dialogue of a short scenario specifically written to contain various emotional situations. Excerpted portions of the recordings were subjected to both quantitative and qualitative analyses. A comparison was also made of recordings from a real‐life situation, in which the emotions of a speaker were clearly defined, with recordings from an actor who simulated the same situation. Anger, fear, and sorrow situations tended to produce characteristic differences in contour of fundamental frequency, average speech spectrum, temporal characteristics, precision of articulation, and waveform regularity of successive glottal pulses. Attributes for a given emotional situation were not always consistent from one speaker to another.
Article
Previous correlational studies have found no relationship between speaker height, weight and speaking fundamental frequency, although it has often been claimed that listeners can correctly identify the height, weight, and bodily build of speakers and that voice pitch is one of the cues used. In this study various social factors were controlled for, and contrasting samples of speech from each subject were analysed. Twelve men and 15 women, drawn from a socially homogeneous group, were asked to read two passages and to phonate the vowel /a:/ at "their lowest attainable pitch." The median speaking fundamental frequency from both passages was calculated and a measure of basal F 0 was obtained from the phonation of /a:/. In contrast to other studies, a relationship was found between speaker height and median speaking fundamental frequency, but no relationship was found between speaker weight and F 0. The correlation between median speaking fundamental frequency and height was significant only in the male sample and in one passage. Physical and social interpretations for these findings are discussed.
Article
Tape recordings of telephone conversations of Consolidated Edison's system operator (SO) and his immediate superior (CSO), beginning an hour before the 1977 New York blackout, were analyzed for indications of psychological stress. (SO was responsible for monitoring and switching power loads within the Con Ed network.) Utterances from the two individuals were analyzed to yield several pitch and amplitude statistics. To assess the perceptual correlates of stress, four groups of listeners used a seven-point scale to rate the stress of SO and CSO from either randomized vocal utterances or transcripts of the randomized utterances. Results indicated that whereas CSO's vocal pitch increased significantly with increased situational stress, SO's pitch decreased. Listener ratings of stress from the voice were positively related to average pitch. It appears that listener's stereotype of psychological stress includes elevated pitch and amplitude levels, as well as their increased variability.
Article
The relationship between age-related changes in body physiology and certain acoustic characteristics of voice was studied in a sample of 48 men representing three chronological age groupings (25-35, 45-55, and 65-75) and two levels of physical condition (good and poor). A fundamental frequency analysis program (SEARP) was used to measure mean fundamental frequency, jitter, shimmer, and phonation range from samples of connected speech and sustained vowel production. Subjects in good physical condition produced maximum duration vowel phonation with significantly less jitter and shimmer and had larger phonation ranges than did subjects of similar chronological ages who were in poor physical condition. These differences were most apparent in the productions of the elderly subjects. While chronological aging is undoubtedly a contributor to such changes in the acoustic characteristics of voice, these results suggest that age-related changes in body physiology, or physiological aging, also must be considered.
Article
Ambiguity in the interpretation of much current research on interpersonal perception is held to be the result of certain methodological confusions. Several of these conceptual problems are described and suggestions for dealing with them presented. 23 references.
Article
The hypothesis that specific impressions are determined by voice qualities was tested. Specific impressions were defined as ratings on the Potency and Activity factors of the semantic differential. 10 schizophrenics and 11 nonschizophrenics were recorded reading the same passage. Es heard the voices in random order and made (a) semantic-differential ratings and (b) judgments of the readers they perceived as being schizophrenic. 3 of 4 predictions concerning the effect of voice qualities were confirmed. The results indicate the validity of the hypothesis that specific impressions of a speaker's physical characteristics and demeanor are determined by that speaker's voice qualities, and that adjective pairs representing the Potency and Activity factors are sensitive to these differences.
Article
This paper seeks to disentangle some of the many effects which contribute to social perception scores, and to identify separately measurable components." The components of the Accuracy (with which the judge perceives Others) score and of the Assumed Similarity (between the judge and another person) score are discussed in the text and formulated mathematically in an appendix. Illustrations are provided of applications of the model, for the practical use of judgments in the clinic, the school, and elsewhere. Understanding and use of social perception data will be enhanced by "careful subdivision of global measures" and by more explicit theory in order to reduce the investigator's "measures to the genuinely relevant components." 34 references.
Language: The sociocultural context
  • G R Guy
Guy, G. R. (1988). Language and social class. Linguistics: The Cambridge survey. In F. J. Newmeyer (Ed.), Language: The sociocultural context (pp. 37-63). Cambridge, UK: Cambridge University Press.
For males speakers, n ¼ 20; for female speakers, n ¼ 19 Judging personality from voice
Values in parentheses are standard deviations. For males speakers, n ¼ 20; for female speakers, n ¼ 19. References Allport, G. W., & Cantril, H. (1934). Judging personality from voice. Journal of Social Psychology, 5, 37–55.
  • R M Krauss
624 R.M. Krauss et al. / Journal of Experimental Social Psychology 38 (2002) 618–625
Effects of aging on speech and voice. Physical and Occupational Therapy in Geriatrics
  • A Caruso
  • P Mueller
  • B B Shadden
Caruso, A., Mueller, P., & Shadden, B. B. (1995). Effects of aging on speech and voice. Physical and Occupational Therapy in Geriatrics, 13, 63–80.
Speech style and social evaluation
  • H Giles
  • N F Powsland
Giles, H., & Powsland, N. F. (1975). Speech style and social evaluation. New York: Academic Press.
Language and social class. Linguistics: The Cambridge survey
  • Guy