Proceedings on Privacy Enhancing Technologies ; 2022 (1):6–27
Jacob Leon Kröger*, Leon Gellrich, Sebastian Pape, Saba Rebecca Brause, and Stefan Ullrich
Personal information inference from voice
recordings: User awareness and privacy concerns
Abstract: Through voice characteristics and manner
of expression, even seemingly benign voice recordings
can reveal sensitive attributes about a recorded speaker
(e. g., geographical origin, health status, personality).
We conducted a nationally representative survey in the
UK (n = 683, 18–69 years) to investigate people’s aware-
ness about the inferential power of voice and speech
analysis. Our results show that – while awareness levels
vary between diﬀerent categories of inferred informa-
tion – there is generally low awareness across all par-
ticipant demographics, even among participants with
professional experience in computer science, data min-
ing, and IT security. For instance, only 18.7% of par-
ticipants are at least somewhat aware that physical and
mental health information can be inferred from voice
recordings. Many participants have rarely (28.4%) or
never (42.5%) even thought about the possibility of
personal information being inferred from speech data.
After a short educational video on the topic, partici-
pants express only moderate privacy concern. However,
based on an analysis of open text responses, uncon-
cerned reactions seem to be largely explained by knowl-
edge gaps about possible data misuses. Watching the
educational video lowered participants’ intention to use
voice-enabled devices. In discussing the regulatory im-
plications of our ﬁndings, we challenge the notion of
“informed consent” to data processing. We also argue
that inferences about individuals need to be legally rec-
ognized as personal data and protected accordingly.
Keywords: privacy, voice recording, speech, microphone,
voice assistant, smart speaker, inference attack
Received 2021-05-31; revised 2021-09-15; accepted 2021-09-16.
*Corresponding Author: Jacob Leon Kröger: Weizen-
baum Institute for the Networked Society, Technische Univer-
sität Berlin, Germany, E-mail: email@example.com
Leon Gellrich: Universität Potsdam, Germany
Sebastian Pape: Goethe Universität, Frankfurt, Germany
Saba Rebecca Brause, Stefan Ullrich: Weizenbaum Insti-
tute for the Networked Society, TU Berlin, Germany
Microphones are found everywhere in today’s
technology-based society. They are embedded not only
into phones, tablets, laptops, electronic toys, cameras,
wearables, and car dashboards but also into intercoms,
baby monitors, remote controls, smart speakers, and
all sorts of other smart home devices. Recordings from
these omnipresent microphones, which contain voice
commands, memos, and private conversations, are often
accessible to a range of diﬀerent parties. For example,
microphones in mobile devices are regularly accessed
by device manufacturers, platform providers, installed
apps, and even by third-party software components
completely invisible to the ordinary user . Private
calls and voice messages can, if not properly encrypted,
be intercepted by instant messaging providers, network
operators, videoconferencing services, and other inter-
mediaries [54, 81, 100]. Similarly, audio data uploaded
to a cloud storage system, social networking site, or
media-sharing platform can be accessible not only to
the audience intended by the user but also to the respec-
tive service and/or infrastructure provider . More
recently, the soaring popularity of voice-enabled de-
vices  and the COVID-19 pandemic with increased
rates of virtual meetings and voice/video calls  have
substantially added to the volume of speech data avail-
able to corporations.
Beyond their legitimate processing purposes, these
organizations may use personal information extracted
from voice recordings for malicious ends or pass it on to
other parties with unknown intentions, potentially ex-
posing users to the risk of discrimination, invasive adver-
tising, extortion, identity theft, and other types of fraud
and abuse [15, 22]. As shown in recent work, attackers
can build a model of a victim’s voice using only a limited
number of voice samples in order to fool voice-based user
authentication systems or to mimic the victim in speech
contexts (e. g., leaving fake voice messages or posting
fake statements on the Internet) . The use of speech
data for unauthorized and unexpected purposes is not
limited to hackers and organized crime groups, but also
practiced by government agencies [81, 100] and technol-
Personal information inference from voice recordings: User awareness and privacy concerns 7
ogy companies, including major multinational corpora-
tions [54, 88].
There is extensive research on the privacy implica-
tions of microphone-equipped devices, with numerous
studies looking into user perceptions and concerns (cf.
Sect. 2.1). However, existing work in this ﬁeld has al-
most exclusively focused on linguistic content, i.e., what
a person says in a voice recording. An equally serious,
yet largely neglected, privacy threat is posed by the
fact that sensitive personal information can be inferred
from how a person speaks. In fact, certain patterns and
correlations in recorded speech can be much more re-
vealing than the words themselves (cf. Sect. 2.2). As we
show in a recent literature review , a speaker’s voice
characteristics and manner of expression can implic-
itly contain information about his or her geographical
origin, age, personality, emotions, level of sleepiness,
physical and mental health condition, and more. So
far, little is known about users’ perceptions of the risks
associated with these possible inferences (cf. Sect. 2.3).
Contribution. To examine privacy concerns related to
voice recordings, we conducted a survey of 683 Inter-
net users in the UK. Our survey is, to our knowledge,
the ﬁrst to focus speciﬁcally on the privacy impacts of
– Our results show a widespread lack of awareness
about the inferential power of voice and speech anal-
ysis, with varying levels of awareness for diﬀerent
types of inferences. (Sect. 5.1)
– While we observe diﬀerences in awareness across de-
mographic groups, these diﬀerences are not large.
Even participants with professional experience in
the ICT ﬁeld score low on awareness. (Sect. 5.2)
– Participants’ reactions to audio-based inferences are
quite evenly distributed between worried and un-
worried. An analysis of open text responses oﬀers
insight into their reasoning. (Sect. 5.3)
– Participants’ intention to use voice-controlled vir-
tual assistants signiﬁcantly decreased after a short
educational video on the topic. (Sect. 5.4)
Outline. The paper proceeds as follows. First, Sect. 2
reviews related literature. Then, we describe our re-
search goals (Sect. 3) and methodology (Sect. 4). In
Sect. 5, we present the study results, a discussion of
which follows in Sect. 6. We reﬂect on the limitations
of our study in Sect. 7, before we conclude the paper in
2 Related work
2.1 Privacy perceptions and concerns
about audio recording
There are a variety of studies on users’ perceptions re-
garding the privacy impacts of microphone-equipped
devices. Aspects that have been investigated include
people’s mental models for a privacy/utility trade-
oﬀ [12, 50, 52], concerns about always-listening de-
vices [12, 13, 35, 52, 55, 102], user trust in providers
of voice-based services [50, 52], concerns about voice
data being shared with third parties or used for other
purposes than stated [60, 91], and concerns around
microphone-equipped devices getting hacked and other
forms of unauthorized access [52, 55, 102].
Studies have also investigated how the context of
use aﬀects the acceptability of audio recordings (e. g., at
work  or in public ). Moorthy and Vu , for ex-
ample, investigated the usage of voice-controlled virtual
assistants and found that participants preferred to use
such devices in private locations to avoid drawing em-
barrassing attention and being overheard by strangers.
Another line of research focuses on users’ awareness
and use of privacy controls for microphone-equipped de-
vices, and the willingness to pay for security and privacy
features [21, 50, 51, 60, 62]. Recent studies suggest that
users are poorly informed about potential privacy risks
and therefore not particularly concerned about embed-
ded microphones in their devices [50, 51, 55, 61, 102].
2.2 Sensor-based inference of personal
It has long been known that sensor data from consumer
devices can be analyzed to extrapolate patterns and
draw sensitive inferences about the user. The mining of
data to illegitimately gain knowledge about a person is
referred to as an “inference attack” . Some published
review articles summarize data categories that can be
inferred from IoT and mobile sensor data [37, 87, 90],
video game data , accelerometer data , and eye
tracking data [40, 53].
In a recent literature review, we have examined the
wealth of information that can be extracted from voice
recordings . Through the lens of advanced data anal-
ysis, the voice and manner of expression of a recorded
speaker may reveal information about his or her geo-
Personal information inference from voice recordings: User awareness and privacy concerns 8
graphical origin [5, 32], gender1and age [34, 83], mental
health [4, 73], physical health [17, 18], level of intoxica-
tion [9, 86], moods and emotions [36, 95], sleepiness and
fatigue [17, 86], and personality traits [78, 85].
For example, researchers have used speech errors
and irregularities (e. g., number of false and unintelligi-
ble words, interrupts, hesitations) and rhythmicity fea-
tures for distinguishing alcoholized from non-alcoholized
speech [9, 86], used voice hoarseness and sounds like
coughs and sniﬄes for the detection of sore throats and
ﬂu infections [18, 33], and used voice pitch variations
and speech energy levels for automatic emotion recog-
nition (e. g., anger, compassion, disgust, happiness, sur-
prise) [33, 36, 95]. A large variety of features, including
speaking rate, loudness, spectral features and character-
istics of linguistic expression, has been applied for the
inference of personality traits [78, 85].
While such algorithmic predictions are of course not
always correct and can also be signiﬁcantly impaired
by situational factors (e. g., ambient noise, reverbera-
tion, microphone quality), remarkable accuracies have
already been reported. Polzehl , for instance, reached
85.2% accuracy in classifying speakers into ten diﬀerent
personality classes along the Big Five traits (openness to
experience, conscientiousness, extraversion, agreeable-
ness, neuroticism). From voice energy features, Sadjadi
et al.  estimated the age of speakers with a mean ab-
solute error of 4.7 years. In computational paralinguis-
tics challenges, researchers have achieved up to 91% ac-
curacy in speech-based intoxication detection . And,
while mental health insights can also be derived from
acoustic characteristics (e. g., monotone speech, glot-
tal features) , Bedi et al.  analyzed semantic co-
herence and speech complexity to automatically detect
emergent psychosis in test subjects, reaching the classi-
ﬁcation accuracy of traditional clinical interviews.
Findings from such experimental studies usually re-
fer to speciﬁc experimental setups and demographic
groups (e. g., “prediction of major depression in adoles-
cents” ) and are subject to limitations (e. g., labora-
tory conditions, limited sample size). Therefore, while
providing evidence that speech-based inferences in their
respective ﬁeld are possible in principle, their speciﬁc
1Note that people’s personal experience and internal under-
standing of their gender can vary from the gender socially as-
signed to them based on reproductive organs, anatomy, chromo-
somes, etc. Most existing work in the ﬁeld under investigation
focuses on inferring people’s self-reported gender.
ﬁndings are not necessarily generalizable to all people
and all real-life situations.
It should be noted, on the other hand, that data con-
trollers with access to speech data (e. g., large tech cor-
porations) can be much better equipped in terms of bud-
get, technical expertise and training data than the re-
searchers cited above, meaning that the risk of undesired
inferences from audio data is likely bigger in real life
than it appears based on published results. Since data
analysis methods are often subject to non-disclosure
agreements, the most advanced know-how in voice and
speech analysis arguably rests within the industry and is
not publicly available . It can also be assumed that
the variety and eﬀectiveness of audio-based inference
attacks will further increase with the rising popularity
of voice-enabled devices  and continuing advances
in computing technologies and audio analysis methods
(e. g., feature optimization  and deep learning ap-
proaches ). Therefore, despite the remaining tech-
nological challenges and limitations, such attacks pose a
real and growing threat to consumer privacy that needs
to be taken seriously and thoroughly investigated.
Existing products and features, such as voice-
analytics tools for hiring assessment , call cen-
ters  and for illness detection based on smart speaker
voice commands , indicate that companies intend to
use speech data to draw sensitive inferences about users
2.3 User awareness and perceptions about
sensor-based inference attacks
There is little research on users’ knowledge of the in-
ferential power of mobile and IoT sensor data. A few
previous studies have investigated people’s privacy con-
cerns associated with inferences that can be drawn from
physiological sensors (e. g., ECG and respiration) ,
smartphone motion sensors [16, 61], in-home sensing
applications [14, 103], and IoT systems for companies
(e. g., sensors for room occupancy monitoring and en-
ergy consumption tracking) . Existing research indi-
cates a widespread lack of awareness about sensor-based
inference attacks [16, 61, 103]. Mehrnezhad et al. 
found that smartphone users are not aware that vari-
ous mobile sensors can be exploited to infer a personal
identiﬁcation number (PIN) typed on the touchscreen.
In general, the researchers observed very low levels of
understanding about the existence and functioning of
embedded sensors. When presented with examples of
personal data inference, users’ privacy concerns tend
Personal information inference from voice recordings: User awareness and privacy concerns 9
to increase [16, 61, 80]. However, among the partici-
pants surveyed by Crager et al. , perceptions about
inference attacks signiﬁcantly varied between diﬀerent
demographic groups. In contrast to our study, none of
these research eﬀorts focuses speciﬁcally on audio data.
And in existing studies on privacy concerns about voice
recordings, threats posed by inferential analytics have
been largely overlooked [50, 51, 55, 64].
3 Research goals
When the words spoken in a voice recording directly
contain private information (e. g., expression of political
opinion, credit card details, health-related information),
the sensitivity of the recording is obvious. What has not
been suﬃciently explored in previous research, however,
is people’s awareness of the wealth of information that
can be inferred from a speaker’s voice characteristics
and manner of expression (cf. Sect. 2). To ﬁll the identi-
ﬁed research gap, this study examines users’ awareness
and concerns regarding audio-based inference attacks,
the vulnerability of diﬀerent demographic groups to this
privacy threat, and whether informing people on this is-
sue changes their intentions towards using microphone-
equipped devices. In examining these issues, we focus on
eight types of audio-based inferences that have emerged
from our literature search (cf. Sect. 2.2). For analysis,
we group these inferences into three clusters: inferences
about demographics (DEM), short- and medium-term
states (STATE), and physical and psychological traits
(TRAIT). DEM includes geographical origin, gender
and age. STATE includes level of intoxication, moods
and emotions, sleepiness and fatigue. TRAIT includes
mental health, physical health, and personality traits.
Note that this paper makes no claim to be exhaustive.
For instance, inferences could also be drawn from back-
ground sounds in voice recordings (e. g., media, urban
or animal sounds)  or ultrasonic signals , which
can be privacy-sensitive but were not included as they
do not directly relate to users’ voice and speech charac-
teristics. We pose the following research questions:
RQ-1: How aware are people that personal informa-
tion can be inferred from voice recordings?
Since humans tend to perform quite well in estimat-
ing certain speaker attributes based on voice features
in everyday life (e.g., age, gender, emotions, intoxica-
tion) [49, 65, 84], they may better understand the possi-
bility of automated inferences in these areas (compared
to diagnosing a speciﬁc mental disorder based on speech
features, for example). Human perception of voice iden-
tity is particularly related to a speaker’s age and gen-
der [6, 48, 65]. Thus, we postulate:
H1: The level of awareness is higher for DEM inferences
than for STATE inferences, and lowest for TRAIT
RQ-2: How does the level of awareness diﬀer across
This question aims to identify at-risk populations by
correlating awareness levels with participant demo-
graphics. All sorts of domain-related knowledge, tech-
nical understanding and privacy experience could assist
in understanding the possibilities of modern data ana-
lytics. Advanced age, on the other hand, has been asso-
ciated with lower degrees of ICT literacy . We also
explore the relationship between awareness and partici-
pants’ income, gender, and disposition to value privacy.
For lack of clear indications in the literature, these were
tested without directional hypothesis. We postulate:
H2.1: Awareness is positively correlated with with pre-
vious privacy experience, general privacy awareness,
innovativeness (i. e., a person’s tendency to be a
technology pioneer), general level of education, and
with professional experience in data protection law,
computer science, data mining, and IT security.
H2.2: Awareness and age are negatively correlated.
H2.3: There are relationships between awareness and
income, gender, and disposition to value privacy.
RQ-3: What concerns do people have about the infer-
ence of personal information from voice recordings?
Physical and psychological traits are more stable over
time and may thus reveal more about a person’s life and
character than temporary state variables. By contrast,
basic demographic information is widely perceived as
relatively non-sensitive  and is often already entered
during sign-up to a digital service. Thus, we postulate:
H3: The level of concern is higher for TRAIT inferences
than for STATE inferences, and lowest for DEM in-
Personal information inference from voice recordings: User awareness and privacy concerns 10
RQ-4: How do people’s usage intentions for voice-
enabled devices change when being informed on the
For this question, while audio data can be recorded with
any kind of microphone-equipped device, we decided
to focus on voice-controlled virtual assistants (VCVAs).
This choice was made with regard to the soaring pop-
ularity of services like Amazon Alexa and Apple’s Siri
across the globe  and because user voice commands
are usually available to the service providers in unen-
crypted form [54, 88], enabling them to screen the audio
for revealing patterns and correlations. Based on pre-
vious research suggesting that privacy concerns are an
important factor aﬀecting the adoption of voice-enabled
devices [29, 64], we postulate:
H4: The educational intervention will have a negative
impact on VCVA usage intention.
4 Research methodology
Our four research questions were investigated by means
of an online survey. After a brief overview of our study
design, this section will provide detailed descriptions
of our survey instrument (Sect. 4.1), participant re-
cruitment process (Sect. 4.2), characteristics of our
sample (Sect. 4.3) and methods used for data analy-
sis (Sect. 4.4).
Participants’ awareness that personal information
can be inferred from voice recordings (RQ-1) was stud-
ied based on self-reported measures. While alternative
approaches exist (cf. Sect. 7), we asked about partici-
pants’ awareness after showing them a short educational
video2on the topic, as inspired by Crager et al. . To
identify potential at-risk populations with particularly
low levels of awareness (RQ-2), we included multiple
demographic items in the survey and then correlated
the results with participants’ reported levels of aware-
ness. To explore participants’ concerns about personal
information inference from voice recordings (RQ-3), we
queried their reactions to the educational video through
rating scales and one open text question.
To be able to test whether our video had an ef-
fect on participants’ interest in using voice-enabled de-
vices (RQ-4), we decided to use two slightly diﬀerent
2Video clip available here: https://youtu.be/Gr22YqS1_VA.
questionnaires in our study (Grp-A and Grp-B). In
Grp-A, participants were asked about their usage in-
tention twice – once before, and once after the educa-
tional video, thus allowing a within-subject compari-
son to examine the eﬀect of the intervention. However,
repeating one question within a questionnaire may in-
troduce a bias: Participants’ answers to the repeated
question could be inﬂuenced by their previous response
to the same question. Therefore, to create a control
group, participants in Grp-B were asked about their
usage intention only once (after the video). By com-
paring results from the post-intervention question in
Grp-A and Grp-B using a Kolmogorov–Smirnov test, we
checked whether repeating the question in Grp-A sub-
stantially aﬀected participants’ responses. Besides un-
derpinning the validity of the within-subject compari-
son among Grp-A participants, this approach allowed
us to conduct a between-subject comparison between
the pre-intervention question in Grp-A and the post-
intervention question in Grp-B, providing additional in-
sight into the impact of our educational video on par-
ticipants’ intention to use voice-enabled devices.
4.1 Survey instrument
Both questionnaires, Grp-A and Grp-B, consist of one
educational video2and 53 questions, including three at-
tention checks. The video clip and all questions are ex-
actly the same in Grp-A and Grp-B – only one ques-
tion is repeated in Grp-A, as will be detailed below.
The questionnaires were programmed with the software
SoSci Survey (version 3.2.19) . All responses captur-
ing the intensity of feelings or level of agreement were
measured on 5-point Likert-type scales. To allow for re-
producibility, a copy of all survey items can be found in
appendix A. It took participants a median of 8.5 min-
utes to complete our survey. The questionnaires con-
tained the following:
–9 privacy demographic items: Using validated
scales directly adapted from Xu et al. , partici-
pants were asked about their general privacy aware-
ness (PA), disposition to value privacy (DVP), and
previous privacy experience (PPE). Results from
these items were used in answering RQ-2.
–2 items on voice-controlled virtual assistants
(VCVA): To ensure a common level of understand-
ing among participants, this section was started
with a short textual deﬁnition of VCVA, includ-
ing examples of common features. Participants were
Personal information inference from voice recordings: User awareness and privacy concerns 11
then asked (i) how often they use a VCVA in daily
life, and (ii) to what extent they are interested in
starting or continuing to use a VCVA. As explained
in the beginning of Sect. 4, the position of ques-
tion (ii) varied between Grp-A and Grp-B for the
purpose of inter-group comparison and bias control.
In questionnaire Grp-A, the question was posed be-
fore and after the educational video, in Grp-B it
was posed only once after the video. Items from this
block were used in answering RQ-4.
–1 educational video2:As a preparation for ques-
tions on this topic, participants were presented with
a short informational video (1:44 minutes) about
audio-based inferences. The video explains that, for
certain functions, microphone-equipped devices typ-
ically transmit voice recordings to remote company
servers, where the audio data can be analyzed to ex-
tract various kinds of personal information. Based
on previous research (cf. Sect. 2.2), the video lists
categories of data that could be derived from a
speaker’s voice characteristics and manner of ex-
pression, namely geographical origin,gender and
age,mental health,physical health,level of intoxi-
cation,moods and emotions,sleepiness and fatigue,
and personality traits. To ensure that the audio
track is audible and the video is watched through to
the end, we included a preliminary sound check and
displayed a code at the end of the video which was
requested on the following page. If the code was not
entered correctly, the survey was terminated and the
corresponding participant was excluded from anal-
–20 items on awareness and concerns regard-
ing audio-based inference attacks: After the
video, the participants were asked (i) whether they
had been aware that such inferences are possible,
(ii) how concerned they are about the possibility of
such inferences, (iii) how often they have consciously
thought about this issue before, (iv) how common
they think it is for companies to draw such infer-
ences from voice recordings, and (v) how concerned
they are about individual categories of inferred in-
formation. Items (i) and (iii) were used in answering
RQ-1, the other items were used in answering RQ-3.
–10 technology demographic items: Adapting a
9-item scale from Parasuraman and Colby , this
section measures the participants’ level of innova-
tiveness (INNO), i. e., the tendency to be a technol-
ogy pioneer. Then, the participants were asked to se-
lect from a list all types of microphone-equipped de-
vices they own. INNO was used in answering RQ-2,
the other question was used to provide descriptive
–10 items on basic demographics and profes-
sional experience: Participants were queried for
their age, gender, net income, and level of educa-
tion. Also, they were asked to specify their level
of professional experience in the areas of data pro-
tection law, computer science, data mining, and IT
security. These items were used in answering RQ-2.
Three attention checks were incorporated in the survey
to screen for random responders and potential bot sub-
missions (cf. questions 8, 16, 21 in appendix A). Be-
fore the actual online survey was conducted, we ad-
ministered a pretest to a total of 58 participants us-
ing the crowdsourcing platform Amazon Mechnical Turk
(https://www.mturk.com/). In this way, we were able to
test our attention checks and reﬁne the survey instru-
ments, including a clariﬁcation of potentially ambiguous
wording. Based on the pretest results, there were only
Our survey instruments and research procedures
were approved by the Ethics Committee at Goethe Uni-
4.2 Participant recruitment
To access a sample of UK adults, we used the ser-
vices of the online market research ﬁrm respondi AG
(https://www.respondi.com/EN/) which was carefully
selected from a list of ten competing panel providers
and fulﬁls the quality management system standards
of ISO 20252 . Although crowdsourcing platforms,
such as MTurk and Proliﬁc, oﬀer several beneﬁts in
terms of cost eﬃciency, speed, and ﬂexibility, we fa-
vored the option of hiring a panel company for sev-
eral reasons. Above all, while recent studies have ob-
tained high-quality results from crowdsourced sam-
ples [58, 82], there are widespread concerns about gen-
eralizability [11, 74, 98]. Signiﬁcant diﬀerences between
MTurk workers and general population estimates were
found in family composition, political attitudes, and re-
ligiosity , level of education and health behavior ,
social engagement , and internet activity , to
name a few examples.
According to our requirements, the sample for our
study was designed to approximate the age and gender
distribution of adults (18-69 years) from the latest UK
census , which also resembles current population es-
Personal information inference from voice recordings: User awareness and privacy concerns 12
Table 1. Participant demographics
Age Grp-A (n = 349) Grp-B (n = 334)
group male female male female
18-29 39 (11.2%) 40 (11.5%) 39 (11.7%) 40 (12.0%)
30-39 35 (10.0%) 37 (10.6%) 32 (9.6%) 33 (9.9%)
40-49 34 (9.7%) 41 (11.7%) 36 (10.8%) 39 (11.7%)
50-59 33 (9.5%) 30 (8.6%) 31 (9.3%) 30 (9.0%)
60-69 32 (9.2%) 28 (8.0%) 26 (7.8%) 28 (8.4%)
Total 173 (49.6%) 176 (50.4%) 164 (49.1%) 170 (50.9%)
timates from the UK Oﬃce for National Statistics .
Given a desired power of 95% and an estimated eﬀect
size of 0.3, an a priori power analysis revealed a required
sample size of around 200 participants. Considering the
explorative nature of the study and our available re-
sources, we collected valid responses from n = 683 par-
ticipants. Survey completers received a small compen-
sation according to the terms of our panel provider. The
survey was conducted between June 4 and July 1, 2020.
4.3 Sample characteristics
In total, 1.277 participants signed up for the survey. 588
responses were excluded for being incomplete, either be-
cause the participant had closed the questionnaire be-
fore answering all survey questions (n=235), or because
the participant had failed to pass one of our attention
checks (n=353). Additionally, six responses were elimi-
nated due to obvious poor quality of their data, as as-
sessed by independent raters. Our analysis is based on
the remaining ﬁnal sample of 683 participants. The age
of participants ranges from 18 to 69 years (µ= 42.99,
σ= 14.50) with 50.7% being females. A breakdown of
the age and gender distribution for both test groups is
provided in Table 1. 99% of participants report to own
at least one microphone-equipped device (95% smart-
phone, 79% laptop, 54% tablet, 36% smart speaker, 20%
voice-enabled remote control, 14% in-vehicle voice con-
trol interface, 13% smartwatch). All participants are UK
4.4 Data analysis
Statistical analysis. While all scales in our question-
naire are treated as parametric, we expected – due to
the nature of the subject and based on existing lit-
erature – that results throughout the survey would
be highly skewed (e. g., because related work indicates
low awareness levels for sensor-based inference attacks,
cf. Sect. 2.3). After visually checking the histograms,
Shapiro-Wilk tests conﬁrmed that the survey results for
the used scales are not normally distributed (p <0.001).
Thus, we used non-parametric tests for comparative
The Friedman test [23, p. 686ﬀ] was used as a non-
parametric alternative to a repeated-measures ANOVA.
To test the diﬀerence between means of dependent
variables, post-hoc Wilcoxon signed-rank tests with
Bonferroni-corrected alpha [23, p. 914] were used as
a non-parametric alternative to paired t-tests. To test
the diﬀerence between means of independent variables,
a Wilcoxon rank-sum test3[23, p.655ﬀ] was used as a
non-parametric alternative to a two-sample t-test. For
correlation analysis, since the commonly used Pearson
correlation coeﬃcient (Pearson’s r) requires normal dis-
tribution of the sample data when attempting to estab-
lish whether the correlation coeﬃcient is signiﬁcant [23,
p. 219], we used Spearman’s rank correlation coeﬃcient
(Spearman’s ρ) [23, p. 223ﬀ] instead. To obtain ordinal
variables suitable for analysis, the variable income was
clustered and the variable education was recoded (e.g.,
master’s degree above bachelor’s degree), as shown in
the published dataset .
We further conducted regression analyses based on
the Akaike information criterion (AIC), using forward
selection and backward elimination procedures. The
software environment R (version 4.0.0) was used for
statistical data analysis.
Qualitative thematic analysis. Responses to the
open text question (cf. appendix A, №11) were eval-
uated using a thematic analysis as proposed by Brown
and Clarke , which is a method for identifying pat-
terns of meaning within qualitative data.
After familiarizing himself with the data, a ﬁrst
rater systematically assigned descriptive and interpre-
tative codes to all features in the data with potential
relevance to the question posed. The resulting codebook
was then used by a second rater to independently la-
bel and categorize the received responses, adding new
codes where deemed appropriate. We used the Cohen’s
Kappa coeﬃcient to measure inter-rater reliability. All
instances of discrepancy were discussed and resolved
jointly by the two raters. The assigned codes were
then used to identify frequent responses and overarching
themes (cf. Sect. 5.3).
3also known as Mann–Whitney U test
Personal information inference from voice recordings: User awareness and privacy concerns 13
We collected n = 683 complete and valid survey re-
sponses (n = 349 for Grp-A, n = 334 for Grp-B). In
terms of age and gender distribution, Grp-A and Grp-B
are approximately identical, both being nationally rep-
resentative for the UK population between 18 and 69
years. In this section, we analyze the survey responses
with respect to the research questions introduced in
Sect. 3. We have released an annotated and sanitized
dataset containing our results for all participants .
5.1 RQ-1. How aware are people that
personal information can be inferred
from voice recordings?
Our results presented in Fig. 1 indicate widespread un-
awareness of inferences that can be drawn from voice
and speech parameters. Averaged over the eight types
of inferences covered in our questionnaire, 67.6% of par-
ticipants reported to be “not at all” or only “slightly”
aware. We observed, however, that the level of aware-
ness strongly diﬀers between the individual inference
categories. For example, while 48.2% of participants re-
ported to be “somewhat”, “quite” or “very” aware about
the possibility of inferring a speaker’s gender and age
based on a voice recording, this ﬁgure drops to 18.7%
for the inference of physical and mental health informa-
For a statistical analysis of these diﬀerences, we
compared the three clusters of inferences deﬁned in
Sect. 3, namely inferences about demographics (DEM),
short- and medium-term states (STATE), and phys-
ical and psychological traits (TRAIT). A Friedman
test [23, p. 686ﬀ] yielded signiﬁcant diﬀerences in
awareness levels between DEM, STATE, and TRAIT
(χ2= 425.53, p <0.001, Kendall’s W = 0.312).
Post-hoc Wilcoxon [23, p. 667ﬀ] signed-rank tests with
Bonferroni-corrected alpha [23, p. 914] revealed that all
three pair-wise comparisons are signiﬁcant (p <0.001).
In conﬁrmation of hypothesis H1, the test results
show that the level of awareness is higher for DEM
inferences than for STATE inferences, and lowest for
TRAIT inferences. We obtained a moderate eﬀect size
for the DEM-STATE comparison (0.342) and large ef-
fect sizes for the STATE-TRAIT (0.520) and DEM-
TRAIT (0.707) comparisons. Post-hoc power analysis
revealed that these tests had a very high power (> 99%).
AKTUELLE VERSION RQ1 Diagram
TODO: noch in die sections (dem, trait, state) unterteilen! dahinter jeweils vielleicht kurz die Abkürzung
DEM(OG), TRA(IT), STA(TE)?
-- falls Platz, könnte ich hier ausschreiben: demographics, physical & pychological state variables, physical &
psychological trait variables
Gender & age
0% 20% 40% 60% 80% 100%
Sleepiness & fatigue
Level of intoxication
Moods & emotions
State variables (STATE)
Trait variables (TRAIT)
not at all slightly somewhat quite very aware
Fig. 1. Distribution of participants’ level of awareness dependent
on the inferred information
Before taking the questionnaire, a large portion of
participants has “never” (42.5%) or “rarely” (28.4%)
consciously thought about the possibility of personal
information being inferred from speech data. Only a
small minority reports to have pondered on this issue
“often” (7.0%) or “very often” (2.8%).
5.2 RQ-2. How does the level of
awareness diﬀer across demographic
To explore statistical relationships between the aware-
ness levels from RQ-1 and participant demographics, we
ﬁrst calculated three awareness scores for each partic-
ipant: AW_DEM (avg. awareness about DEM infer-
ences), AW_STATE (avg. awareness about STATE
inferences), and AW_TRAIT (avg. awareness about
We then tested for correlations between participant
demographic attributes and AW_DEM, AW_STATE,
and AW_TRAIT using Spearman’s rank correlation
coeﬃcient (Spearman’s ρ) [23, p. 223ﬀ]. An overview
of the correlation results, along with their Bonferroni-
corrected signiﬁcance levels, is provided in Table 6 in ap-
pendix B. Note that for all correlations for which we had
a directional hypothesis (cf. Sect. 3), the conﬁdence in-
tervals are one-sided, meaning they end at 1.00 or –1.00,
Supporting our hypotheses H2.1 and H2.2,
AW_DEM, AW_STATE, and AW_TRAIT are nega-
tively correlated with participant age and positively cor-
Personal information inference from voice recordings: User awareness and privacy concerns 14
Table 2. Regression results for AW_DEM
Coeﬃcients Estimate Std. Error t-value p-value
(Intercept) 1.72 0.34 5.15 3.85e-07
age –0.01 0.00 –2.73 0.00653
gender –0.30 0.11 –2.67 0.00798
EXP_DM 0.23 0.09 2.70 0.00712
EXP_CS 0.17 0.07 2.54 0.01132
INNO 0.13 0.07 1.97 0.04914
PA 0.11 0.07 1.41 0.15857
Adjusted R2: 0.1532
Table 3. Regression results for AW_STATE
Coeﬃcients Estimate Std. Error t-value p-value
(Intercept) 0.62 0.26 2.52 0.01208
EXP_DM 0.17 0.08 2.11 0.03482
EXP_CS 0.16 0.06 2.79 0.00558
PA 0.28 0.08 3.60 0.00036
DVP –0.12 0.06 –1.90 0.05805
PPE 0.13 0.08 2.11 0.03546
Adjusted R2: 0.1499
Table 4. Regression results for AW_TRAIT
Coeﬃcients Estimate Std. Error t-value p-value
(Intercept) 0.58 0.27 2.19 0.02912
age –0.01 0.00 –1.98 0.04854
EXP_DM 0.25 0.07 3.62 0.00033
EXP_CS 0.14 0.05 2.87 0.00431
PA 0.15 0.06 2.56 0.01070
PPE 0.10 0.06 1.49 0.13707
Adjusted R2: 0.1954
AW_: average level of awareness about audio-based in-
ference of demographics (DEM) / short- and medium-
term states (STATE) / phyisical and psychological traits
(TRAIT); INNO: innovativeness; PA: privacy aware-
ness; DVP: disposition to value privacy; PPE: previ-
ous privacy experience; EXP_: professional experience
in data mining (DM) / computer science (CS)
related with general level of education, degree of inno-
vativeness (INNO), previous privacy experience (PPE),
general privacy awareness (PA), and with professional
experience in data protection law (EXP_DP), computer
science (EXP_CS), data mining (EXP_DM), and IT
security (EXP_IS). Across AW_DEM, AW_STATE,
and AW_TRAIT, the most notable correlations were
found with EXP_CS, EXP_DM, and EXP_IS. How-
ever, using guidelines from Dancey and Reidy  to
interpret the results, even these are weak correlations.
The correlation test results further indicate that
men tend to have slightly higher AW_TRAIT and
AW_STATE (but not AW_DEM) than women. No
signiﬁcant correlation was found between participants’
awareness and their disposition to value privacy (DVP)
or level of income, contradicting hypothesis H2.3.
While the observed tendencies suggest that some
population groups are somewhat less informed than oth-
ers (and thus potentially more prone to unwittingly
reveal sensitive information about themselves through
speech data), there seems to be little awareness regard-
ing audio-based inferences throughout all population
segments. Even among those participants who reported
“much” or “very much” EXP_CS (n = 120), EXP_IS
(n = 82), or EXP_DM (n = 60), a large portion (49.4%,
45.9%, and 43.1%, respectively) stated to be “not at all”
or only “slightly” aware (averaged over the eight types
of inferences covered in our questionnaire).
For the continuous demographic variables (INNO,
PA, DVP, PPE, EXP_DP, EXP_DM, EXP_CS,
EXP_IS, gender, age), we additionally conducted re-
gression analyses based on the Akaike information cri-
terion (AIC) to test their predictive power on aware-
ness. After forward selection and backward elimina-
tion procedures, we obtained models for AW_DEM,
AW_STATE, and AW_TRAIT, with ﬁve to six predic-
tors each and adjusted R2ranging from 15% to 19.5%.
The results are shown in Tables 2, 3 and 4. For in-
stance, an increase of EXP_DM by 1 point resulted
in increased AW_DEM by 0.23, AW_STATE by 0.17,
and AW_TRAIT by 0.25. The only predictors that were
consistently signiﬁcant across the three models were
EXP_CS and EXP_DM, the other predictors were less
relevant and varied between the models. The low R2re-
sults indicate that only a small portion of the variance
in awareness can be explained by the demographic vari-
ables from our survey. This supports our above conclu-
sion that all demographic groups under investigation are
similarly vulnerable to audio-based inference attacks,
i.e. similarly prone to disclose more information about
themselves via speech data than they expect.
Regarding the internal consistency of constructs
adapted from the literature (cf. Sect. 4.1), we obtained
a good or excellent Cronbach’s alpha for INNO (0.93),
DVP (0.85), and PA (0.81), according to interpretation
guidelines provided by George and Mallery . For
PPE, we obtained a Cronbach’s alpha of 0.66, indicat-
ing a questionable internal consistency . The lower
Cronbach’s alpha for PPE is mainly driven by the sec-
ond item of the construct (cf. appendix A, №3) which
shows relatively low consistency with the other items of
Personal information inference from voice recordings: User awareness and privacy concerns 15
PPE. Thus, the second item should receive special at-
tention in future uses of the construct and may require
improvement and re-validation.
5.3 RQ-3. What concerns do people have
about the inference of personal
information from voice recordings?
After watching a short educational video on the privacy
implications of voice and speech analysis (cf. Sect. 4.1),
participants were asked how worried they are about the
possibility of personal information inference from speech
data. Responses to this question were mixed. While
38.7% of participants reported to be “not at all” or
“slightly” worried, a similar proportion (40.7%) stated
to be “quite” or “very” worried.
The average participant believes that it is common
rather than exceptional for companies to infer personal
information from voice recordings. When asked to es-
timate the prevalence of this practice, 12.6% selected
“somewhat” or “very” uncommon, 48.0% were “unde-
cided”, and 38.4% selected “somewhat” or “very” com-
In an open text question, we asked the participants
to provide a reasoning for their reported level of concern.
Except for a handful of cases, all participants oﬀered an
intelligible response. The responses were evaluated by
two raters using the thematic analysis method proposed
by Brown and Clarke , as described in Sect. 4.4. The
number of assigned codes per response varies between
1 and 6. The coding yielded a Cohen’s Kappa of 0.83,
indicating a high degree of inter-rater reliability . Af-
ter instances of discrepancy were discussed and resolved
jointly by the two raters, the assigned codes were used
to identify overarching themes. Along with the complete
dataset of our study, we have released the resulting code-
book, including all identiﬁed themes .
In the following, we will summarize our ﬁndings
regarding the most salient themes, namely partici-
pants’ emotional reactions (Sect. 5.3.1), feared data mis-
uses (Sect. 5.3.2), perceived beneﬁts and inevitability of
voice-based technology (Sect. 5.3.3) as well as partici-
pants’ knowledge gap’s and misconceptions (Sect. 5.3.4).
Additionally, ﬁndings regarding participants’ concerns
towards speciﬁc types of inferences will be presented
in Sect. 5.3.5. When quoting responses, we will either
state the corresponding number of participants (Ps) or,
if one individual participant is quoted, state the respec-
tive participant ID from the dataset (P1to P683).
5.3.1 Emotional reactions
Approximately half of the open text responses illustrate
or emphasize negative feelings. For instance, the infer-
ential power of voice and speech analysis is perceived
as “alarming” (2 Ps), “frightening” (2 Ps), “unnerv-
ing” (2 Ps), “unsettling” (2 Ps), “shocking” (2 Ps), “dis-
turbing” (3 Ps), “uncomfortable” (4 Ps), “scary” (8 Ps),
“concerning” (8 Ps), “Big Brother[ish]” (12 Ps), “wor-
rying” (14 Ps), and “intrusive” (15 Ps). Partially, the
negative reactions are quite strong, showing that con-
fronting people with this issue can elicit “a great sense
of helplessness” (P361).
Participants are surprised at the variety of possible
inferences, stating that they “hadn’t considered” (P535),
“didn’t realise” (3 Ps), had “no idea” (2 Ps) that “all
this information could be revealed [. . . ] by a voice
recording” (P374). In some cases, participants even ex-
press amazement about the possibilities of modern voice
and speech analysis, e. g. describing them as “fascinat-
ing” (2 Ps) or “far beyond what I dreamed” (P411). Oth-
ers express confusion, e. g. by stating, “I don’t quite un-
derstand how they could possibly get this information
from voice alone” (P56). Participant P441 concludes with
the words: “The world is a lot cleverer than we realise.”
On the other hand, there are also participants who
state to be completely indiﬀerent about the the privacy
implications of voice recordings. “If I like [a technol-
ogy], I don’t care about side eﬀects”, says P399 and P182
claims she “couldn’t care less what information people
have on me.”
5.3.2 Feared data misuses
Participants express concern that microphone-equipped
devices may collect more data than required for their
functionality, and that the collected data might be used
for “unrelated purposes” (P460 ) without the user’s con-
sent or awareness. While companies usually provide
tomers “rarely read them carefully or understand their
implications fully” (P678). Feared types of data pro-
cessing and data misuse include targeted advertising
to shape “political views/consumption habits” (P524 ),
data-based discrimination by insurances and employ-
ers (12 Ps) as well as “fraud” (2 Ps) and “identity
Further, there is concern that information extracted
from voice recordings could end up in the “wrong
hands” (6 Ps) by being passed on or leaked to third
Personal information inference from voice recordings: User awareness and privacy concerns 16
parties, such as aﬃliate companies, hacker groups, or
governmental agencies. Opposition is not only directed
against criminal data use and governmental surveillance
but also explicitly against “using very personal informa-
tion for commercial purposes” (P39 ).
Additional doubts and worries are expressed over
the accuracy of inference algorithms. “A lot depends
on how this information is interpreted”, says P671 . In-
ferences are feared to be “inaccurate and presump-
tive” (P642) and “taken out of context” (P644), leading
to “assumptions being made that aren’t actually true
for the individual” (P26).
5.3.3 Perceived beneﬁts and inevitability of
Despite their concerns, some participants perceive the
disclosure of sensitive personal data as a necessary
trade-oﬀ for using modern technology. “Unfortunately,
I feel this is just the way the world is heading”, says
P559. Others agree: “companies have been collecting
data for years” (P476) and “there is little we can do
about it” (P664 ).
There are also responses speciﬁcally focusing on
the beneﬁcial uses of microphone-equipped technology,
e. g., for creating “convenient products” (P129), support-
ing “security and crime-ﬁghting services” (P137), target-
ing “adverts to sell me products/services that may as-
sist” (P303) or “alerting medical services if someone is in
danger due to physical or mental health issues” (P493).
Voice control is perceived by some as “an evolutionary
step in how we and our children will interact with de-
vices” (P529) which will “improve humanity” (P413 ) and
be used “for the greater good” (P502). While they also
see potential downsides, optimistic participants are con-
ﬁdent that “the beneﬁts far outweigh the negatives at
the moment” (P610) and that privacy loss is a “small
price to pay for more convenient products” (P129 ).
5.3.4 Knowledge gaps and misconceptions
It is striking that – although we speciﬁcally asked for
their reasoning – none of the unworried participants
provided a solid justiﬁcation for their reported lack
of privacy concern. Instead, their responses reveal po-
tentially dangerous, yet understandable, misconceptions
and false senses of security. For instance, unconvinced
by our educational video, some participants do not be-
lieve that the presented audio-based inferences are tech-
nically feasible at all. While the sources and arguments
compiled in Sect. 2.2 suggest otherwise, participants’
disbelief in a short educational video on a complex and
unfamiliar topic is of course an understandable reaction.
Other participants do not see how data extracted
from voice recordings could be used against their inter-
est. “I really don’t care as I have no idea how this in-
formation could be used to my detriment”, states P164.
And P447 asks: “Why would I be worried? I have nothing
to hide.” The nothing-to-hide argument, which was put
forth by many participants, has been criticized for its
narrow view on privacy and for ignoring various threats
that can arise from personal data being available to ma-
licious or negligent parties .
Some participants explain that they are not worried
because they do not own a voice-controlled device, such
as a smart speaker. As exempliﬁed in Sect. 1 and illus-
trated in our educational video, audio data (e. g., voice
messages, voice memos, voice calls) can be recorded,
analyzed and transmitted to remote servers by a wide
variety of devices – not only by voice-controlled de-
vices. Even living entirely without microphone-equipped
devices would not guarantee protection against audio-
based inference attacks, as a person’s voice can – inten-
tionally or unintentionally – be recorded by other peo-
ple’s devices (cf. Sect. 6.2). It should be noted, however,
that our video focuses on direct user-device interaction
and puts a slight emphasis on voice-controlled devices
to prepare participants for questions related to RQ-4,
which could be a source of misunderstanding.
Finally, a few participants base their sense of secu-
rity on the assumption that their data will always be
stored securely and only used sparingly and responsi-
bly. For example, they doubt that any information ex-
tracted from voice recordings “would be used to identify
me personally” (P396 ), trusting that such data “would
be in an anonymous format anyway” (P110 ), whereas in
reality this is often not the case. Others express conﬁ-
dence that their “privacy settings do the job” (P211) and
that companies would not use data “negatively against
me” (2 Ps), “for truly bad purposes” (P449 ) or “in any
negative ways” (P353 ). Participant P654 states: “Given
the [. . . ] general consensus of privacy violations being
bad for business, I don’t worry too much about inappro-
priate use.” We also received vague and confusing state-
ments along this line, such as “It doesn’t trace it back
to me personally as they will never meet me” (P499) or
“I assume the Internet has protection in place” (P427).
In reality, however, companies can clearly leak or
exploit personal data in harmful ways and commonly
share such data with a range of third parties (cf. Sect. 1).
Personal information inference from voice recordings: User awareness and privacy concerns 17
In light of the above observations, unworried reactions
among participants appear to be largely explained by
5.3.5 Concerns towards speciﬁc types of inferences
We also asked the participants how concerned they
would be if a company used voice recordings to in-
fer speciﬁc types of information about them without
their awareness. The results are shown in Fig. 2. As
can be seen at ﬁrst glance, the reported level of concern
considerably varies between the information categories.
For instance, while 60.8% of participants reported to
be “quite” or “very” concerned about the disclosure of
mental health information, only 31.3% of participants
showed the same level of concern about inferences on
their gender and age.
For a statistical analysis of these diﬀerences, we
again compared the clusters deﬁned in Sect. 3, namely
inferences about demographics (DEM), short- and
medium-term states (STATE), and physical and psy-
chological traits (TRAIT). A Friedman test yielded sig-
niﬁcant diﬀerences in concern levels between these clus-
ters (χ2= 386.87, p <0.001, Kendall’s W = 0.283).
Post-hoc Wilcoxon signed-rank tests [23, p. 667ﬀ] with
Bonferroni-corrected alpha revealed that all three pair-
wise comparisons are signiﬁcant (p <0.001).
The test results conﬁrm hypothesis H3 by show-
ing that the level of concern is lowest for DEM infer-
ences, followed by STATE inferences, and highest for
TRAIT inferences. We obtained a moderate eﬀect size
for the DEM-STATE comparison (0.409) and large ef-
fect sizes for the DEM-TRAIT (0.643) and STATE-
TRAIT (0.513) comparisons. Post-hoc power analysis
revealed that these tests had a very high power (> 99%).
Even in their response to the open text question (cf.
Sect. 5.3), some participants have focused their concern
on speciﬁc data categories (e. g., P519 : “Health [data]
isn’t really something you want to be shared without
consent”), while other types of inferred data, such as
age, gender and level of intoxication, were rarely men-
tioned at all. Analogous to the knowledge gaps noted in
Sect. 5.3.4, the variation in concern levels between dif-
ferent types of inferences could indicate a lack of aware-
ness or understanding of how certain data categories can
AKTUELLE VERSION RQ3 Diagram
TODO: noch in die sections (dem, trait, state) unterteilen! dahinter jeweils vielleicht kurz die Abkürzung
DEM(OG), TRA(IT), STA(TE)?
-- falls Platz, könnte ich hier ausschreiben: demographics, physical & pychological state variables, physical &
psychological trait variables
Gender & age
0% 20% 40% 60% 80% 100%
Sleepiness & fatigue
Level of intoxication
Moods & emotions
State variables (STATE)
Trait variables (TRAIT)
not at all slightly somewhat quite very concerned
Fig. 2. Distribution of participants’ level of concern dependent on
the inferred information
5.4 RQ-4. How do people’s usage
intentions for voice-enabled devices
change when being informed on the
To examine whether our educational intervention had
an eﬀect on the intention to use a voice-controlled vir-
tual assistant (VCVA), we conducted a between- and a
within-subject comparison, as explained in Sect. 4.
First, we compared responses from Grp-A
(n = 349), where participants were asked about their
VCVA usage intention (VCVA_UI) before the inter-
vention, with responses from Grp-B (n = 334), where
the same question was asked after the intervention.
While a Wilcoxon rank-sum test [23, p.655ﬀ] showed
a signiﬁcant diﬀerence between VCVA_UI in Grp-A
and Grp-B in the direction predicted by hypothesis H4
(p <0.05), i.e. lower VCVA_UI after the intervention,
it only yielded a very small eﬀect size (0.086). Given the
small eﬀect size, post-hoc power analysis also revealed
low power (29%). The slightness of the result is not too
surprising, as the expected eﬀect may have been masked
by those Grp-A participants who already reported no
interest in using a VCVA before the intervention and
thus left no room for a reduction in interest.
We then conducted a within-subject comparison in
Grp-A. Participants in this group were asked about their
VCVA_UI twice, once before and once after the inter-
vention. Of course, repeating a question within a ques-
tionnaire may introduce a bias, as noted in Sect. 4.
However, a Kolmogorov–Smirnov test revealed no sig-
niﬁcant diﬀerence (p = 0.52) between the distribu-
Personal information inference from voice recordings: User awareness and privacy concerns 18
tion of responses to our post-intervention questions on
VCVA_UI in Grp-A and Grp-B (despite the large sam-
ple sizes), which indicates that the two samples belong
to the same distribution and that repeating the question
in Grp-A did not substantially aﬀect the distribution.
Thus, we proceed with the within-subject comparison.
By comparing results from the pre- and post-
intervention items on VCVA_UI within Grp-A, we an-
alyze whether the intervention had a signiﬁcant ef-
fect on VCVA_UI among Grp-A participants. To avoid
the masking eﬀect described above, we excluded those
participants from the analysis who reported to be
“not at all” or only “slightly” interested in using a
VCVA before watching the educational video, leaving
n = 151 participants for the analysis (Grp-A.1). Within
Grp-A.1, a Wilcoxon signed-rank test [23, p. 667ﬀ] not
only yielded a signiﬁcantly decrease in VCVA_UI af-
ter the intervention (p <0.001) but also a large eﬀect
size (0.590). Here, post-hoc power analysis revealed very
high power (> 99%).
These results suggest that for people with a medium
or high interest in using such devices, information on
the privacy implications of voice and speech analysis
can have a strong negative eﬀect on usage intentions,
thus supporting hypothesis H4. Further research is re-
quired to investigate causes and motivational aspects
behind these observations, and to determine whether
these changes in reported usage intention sustain be-
yond the short term and actually translate into shifts in
Table 5 shows VCVA_UI before and after the in-
tervention, ∆VCVA_UI as well as awareness and con-
cern levels, averaged over Grp-A and Grp-A.1. It can
be seen that Grp-A and Grp-A.1 yield similar results
for concern and awareness, indicating that Grp-A.1 is
a representative subgroup of Grp-A. The similarity fur-
ther indicates that those participants who reported to
be “not at all” or only “slightly” interested in using a
VCVA before watching the educational video (and thus
excluded in Grp-A.1) showed little interest not because
they are more concerned than the rest of Grp-A but,
e. g., because they simply have no use for a VCVA.
In Grp-A.1, we additionally tested for correla-
tions between the observed change in VCVA_UI
(∆VCVA_UI) and the participants’ level of concern
and awareness about the possibility of audio-based in-
ference of demographics (DEM), short- and medium-
term states (STATE), and phyisical and psychologi-
cal traits (TRAIT). We found that concerns are posi-
tively correlated with ∆VCVA_UI (p <0.001; Spear-
man’s ρ= 0.38 for DEM, 0.42 for STATE, 0.40 for
Table 5. Mean values for Grp-A and Grp-A.1
Grp-A (n = 349) Grp-A.1 (n = 151)
VCVA_UI (pre) 2.47 3.97
VCVA_UI (post) 2.11 3.20
∆VCVA_UI –0.36 (–14.6%) –0.77 (–19.4%)
AW_DEM 2.47 2.50
AW_STATE 2.08 2.03
AW_TRAIT 1.60 1.68
CON_DEM 2.62 2.53
CON_STATE 3.18 3.06
CON_TRAIT 3.55 3.40
VCVA_UI: VCVA usage intention before (pre) and af-
ter (post) the intervention; ∆VCVA_UI: diﬀerence be-
tween VCVA_UI pre and post intervention; AW_: level
of awareness about audio-based inference of demographics
(DEM) / short- and medium-term states (STATE) / phy-
isical and psychological traits (TRAIT); CON_: level of
concern about DEM / STATE / TRAIT inferences
TRAIT). The self-reported level of awareness about
audio-based inferences was negatively correlated with
∆VCVA_UI (DEM: p <0.01, Spearman’s ρ=−0.23;
STATE: p <0.01, Spearman’s ρ=−0.21; TRAIT:
p<0.05, Spearman’s ρ=−0.21). This means that
VCVA_UI of participants with higher levels of prior
knowledge about audio-based inferences tend to be less
aﬀected by our educational intervention. This is not
surprising, as the intervention should logically have a
higher educational impact on those participants who
knew less on the topic beforehand.
6 Discussion and implications
The results of our survey reveal that there is widespread
lack of understanding about the possibilities of voice and
speech analysis. For instance, 81.3% of participants are
not at all or only slightly aware that physical and men-
tal health information can be inferred from a recorded
speaker’s voice characteristics and manner of expres-
sion. Only 9.8% of participants have often or very often
consciously thought about the possibility of personal in-
formation being inferred from voice and speech param-
Our results, oﬀering a novel contribution by specif-
ically focusing on voice recordings, are consistent with
previous ﬁndings indicating a lack of awareness about
inference attacks based on other types of sensor data [16,
61, 103]. While our analysis shows that awareness varies
signiﬁcantly depending on participants’ demographic
Personal information inference from voice recordings: User awareness and privacy concerns 19
attributes, which was previously also shown to be the
case for motion sensor-based inference attacks , the
average level of awareness is low across all demographic
groups – even among participants with professional ex-
perience in ICT (cf. Sect. 5.2).
In our sample, the degree of worry regarding audio-
based inferences is quite evenly distributed between high
and low. Results from analyzing open text responses
suggest, however, that unconcerned reactions are largely
explained by knowledge gaps about the risks that can
arise from privacy intrusions (cf. Sect. 5.3.4). While
some participants express worry about unauthorized
data leakage to third parties, speciﬁc types of data mis-
use, and about being misrepresented by audio-based in-
ferences (cf. Sect. 5.3.2), our results conﬁrm previous
ﬁndings about people’s unwarranted trust in compa-
nies’ data practices [50, 52, 102, 103] and a widespread
nothing-to-hide mentality [50, 51, 91, 99, 103], poten-
tially resulting in a false sense of security.
At the same time, the reported level of concern
varies signiﬁcantly between diﬀerent categories of in-
ferred data (e. g., high concern about inferred health
information vs. low concern about inferred age – cf.
Sect. 5.3.5), which may be due to individual preferences
but perhaps also indicates a lack of understanding on
how certain data categories can be used against the data
Our educational intervention on the privacy impli-
cations of voice and speech analysis had a signiﬁcant
negative impact on participants’ intention to use voice-
controlled virtual assistants. This result aligns with pre-
vious ﬁndings that users’ privacy concerns tend to in-
crease when they are presented with examples of per-
sonal data inference [16, 61, 80].
6.1 Consumer education and
While Internet-connected microphones have many bene-
ﬁcial applications (e. g., eﬃcient human-computer inter-
action, assistance for physically disabled people, smart
home convenience, driver safety), their increasing ubiq-
uity in modern life calls for a debate on potential social
ramiﬁcations. Besides the already omnipresent micro-
phones in smartphones, laptops and other mobile de-
vices, the number of installed smart speakers is forecast
to reach 640 million globally by 2024 . Nothing is
fundamentally wrong with either microphone-equipped
devices or speech data mining, but there is clearly a
need for appropriate privacy safeguards.
Educating people on existing threats is an impor-
tant starting point – not only to support informed pur-
chase decisions but also to put critical pressure on the
societal actors responsible for protecting consumer pri-
vacy in sensing devices.
With regard to data collection transparency in
voice-controlled devices, there has been a focus on de-
vice recording modes, such as “speech-activated”, “man-
ually activated”, and “always on” [12, 28]. In the face of
recurring security breaches and privacy scandals, users
have not only been advised to use the mute feature of
their voice-controlled devices but also been encouraged
to disconnect power supply or even purposely obfuscate
audio signals to protect themselves against corporate
and governmental eavesdropping .
However, while there are good reasons to be con-
cerned about always-listening devices, it is important
to understand that the mentioned safeguards – even if
eﬀectively applied in practice – will not prevent audio-
based inference attacks (unless, of course, they perma-
nently block the microphone and prevent any record-
ing.) As discussed in this paper, voice and speech char-
acteristics can unexpectedly carry sensitive personal in-
formation, which may later be extracted via advanced
data analytics (cf. Sect. 2.2). Thus, even if a voice assis-
tant is only consciously unmuted by a user to ask for the
weather forecast, for instance, this can already lead to
unwanted information leakage (e. g., based on sociolect,
accent, intonation, pitch, loudness, or a hoarseness in
the user’s voice).
To minimize privacy risks, voice recordings should
preferably be encrypted before any upload or Internet
transfer, and the data processing should take place as
much as possible locally on the user’s device. In cases
where the disclosure of speech data to service providers
is unavoidable (e.g., because the data is necessary for
service functionality or due to resource constraints of
the end device), measures should be taken to prevent
the illegitimate inference of personal information.
Some technical approaches that could help to de-
fend against audio-based inference attacks are diﬀer-
ential private learning, hashing techniques for speech
data, fully homomorphic inference systems, and speaker
de-identiﬁcation by voice transformation [67, 68]. In re-
cent work, for example, Alouﬁ et al. [2, 3] have pro-
posed privacy-preserving intermediate layers to sanitize
user voice input before sharing it with cloud service
providers. These approaches, which are based on the au-
tomatic identiﬁcation and obfuscation of sensitive fea-
tures in speech data, have yielded promising evaluation
results for certain use cases, such as protection against
Personal information inference from voice recordings: User awareness and privacy concerns 20
unwanted emotion [2, 3] and gender recognition  while
maintaining utility of the data for speech and speaker
Where possible without compromising the required
functionality, voice recordings should also be tran-
scribed to text in order to preserve task-relevant infor-
mation while removing speaking speed, rhythm, voice
characteristics, etc. and thus reduce the risk of inference
attacks. In their proposed Preech system for privacy-
preserving speech transcription, Ahmed et al.  apply
voice transformation and the injection of noise to obfus-
cate users’ voice biometrics and thus prevent unautho-
rized identiﬁcation and impersonation.
6.2 Regulatory implications
Considering that (i) existing technical solutions for pro-
tecting against sensor-based inference attacks have se-
vere limitations [68, 87] and are still seen as “embry-
onic research topics” , (ii) companies obviously need
strong incentives to apply privacy-enhancing technolo-
gies , (iii) many users are not willing to pay for pri-
vacy and their willingness to pay depends on the trust
towards the provider of the privacy enhancing technol-
ogy , and (iv) there is – as our study underscores –
a very low level of risk awareness among users, adjust-
ments in privacy regulation may be required as well.
To achieve a minimum level of transparency and
oversight, inferences should at least be recognized as
falling within the scope of data protection law. While
the newly introduced California Consumer Privacy Act
(CCPA), for example, speciﬁcally covers “inferences
drawn” as part of its deﬁnition of personal information,
most other data protection laws – including progres-
sive ones, such as EU’s General Data Protection Regu-
lation (GDPR) – do not suﬃciently protect individuals
against undesired inferences . In a detailed legal anal-
ysis, Wachter and Mittelstadt  state that the GDPR
“focuses primarily on mechanisms to manage the input
side of processing. (. . . ) [T]he few mechanisms in Eu-
ropean data protection law that address the outputs of
processing, including inferred and derived data, proﬁles,
and decisions, are far weaker.”
Data protection law could, for example, make it
mandatory for companies to provide comprehensive in-
formation on all types of inferences that they (attempt
to) draw from collected personal data. Given that data
mining algorithms are becoming an increasingly accu-
rate and eﬃcient access path to personal information,
this could be a sensible measure.
For data subjects, being able to answer the question
“who knows what about me?” is a necessary precondi-
tion for exercising other data protection rights (e. g.,
data rectiﬁcation, erasure, restriction of processing) in
an informed manner . The widespread lack of under-
standing of how personal data can be collected, inferred,
and misused calls into question the notion of “informed
consent” and may warrant some form of paternalistic
government intervention. As we argue in other recent
work, people’s privacy choices are typically irrational,
involuntary and/or easily circumventable .
Accordingly, various commentators have proposed
a legal shift from the individualistic paradigm of no-
tice and consent (“privacy self-management”) towards
an increased focus on the ethical and social impacts of
personal data use (e. g., [56, 71, 89, 97]). For instance,
a general legal prohibition of using certain categories of
personal data for ethically indefensible purposes based
on the resulting harm potential could be helpful to pro-
tect consumers from consequences of their own unaware-
Another argument against the self-management ap-
proach is that it ignores the various externalities that
individual privacy choices have on other people and so-
ciety at large . In today’s interconnected world, peo-
ple often share personal data of other people, giving rise
to the notion of “interdependent privacy” . Owners of
microphone-equipped devices can become amateur data
controllers without the data subject’s knowledge or con-
sent. For example, someone might record a phone con-
versation and share it with a third party. Furthermore,
a user’s device can record the voice, activities, etc. of
persons in the vicinity (e.g., relatives, friends, visitors,
bystanders), potentially scaling up the inference prob-
lem by a signiﬁcant factor.
While surveys are widely used in related empirical stud-
ies [16, 61, 79, 80], this form of data collection is subject
to several potential limitations.
There is of course the risk of careless or random
responding. We incorporated multiple attention checks
into our survey to ﬁlter out low-quality responses. Only
those respondents who passed all quality and attention
checks were included in the analysis (cf. Sect. 4.3).
Furthermore, a self-reported survey captures sub-
jective perceptions, which are prone to distortion. In
particular, following an approach proposed by Crager
Personal information inference from voice recordings: User awareness and privacy concerns 21
et al. , we asked participants for their awareness of
audio-based inferences after showing them a short edu-
cational video on the topic. It may have been diﬃcult for
some participants to accurately recall what their level
of knowledge was prior to watching the video.
A possible alternative would be to ask participants
about awareness before showing the video (e. g., by ask-
ing how likely they think diﬀerent types of inferences
are). Even this approach, however, may evoke thoughts
that participants would not have by themselves in ev-
eryday life. Moreover, participants with low levels of
knowledge and skills may have a tendency to overes-
timate their abilities (a cognitive bias referred to as
the Dunning-Kruger eﬀect) . Therefore, future work
could build upon this study by using approaches that
query the knowledge of participants more implicitly and
objectively, instead of using self-reported measures of
Additionally, learning from our study’s limitations,
follow-up studies should thoroughly test participants’
understanding of educational materials (e. g., in the
form of a quiz). It is possible that participants did not
understand everything in the video.
It is also possible that participants exaggerated cer-
tain responses in an attempt to present themselves in a
more positive light, e. g., by stating that they are more
familiar with technology than they actually are, by over-
stating their professional experience in some area, or by
falsely claiming to be less interested in using a virtual
assistant after our educational intervention. This eﬀect
may have been increased by asking participants about
privacy attitudes at the beginning of the survey, prim-
ing them to think about privacy. We cannot exclude
the possibility of a social-desirability bias but believe to
have minimized the risk of occurrence through the neu-
tral framing of our educational video and by informing
participants in advance that the results of our online
survey would be completely anonymous. Asking about
privacy attitudes at the end would not have eliminated
the issue of priming because, in this case, the privacy
focus of the remaining survey may have inﬂuenced par-
ticipants’ responses to these questions.
It should also be noted that our ﬁndings are only
representative for the UK population, which is a typi-
cal WEIRD society (Western, Educated, Industrialized,
Rich, Democratic). Replication studies in other con-
texts, such as in Asian or African countries, are required
to establish cultural validity.
Microphones have become ubiquitous in modern life,
embedded into mobile, wearable, and all sorts of smart
home devices. While these devices provide useful func-
tions, the increasing availability of private voice record-
ings to service providers, device manufacturers, app
vendors, etc. has also become a major threat to con-
sumer privacy. In this study, focusing on an issue that
has received very little research attention to date, we
investigated people’s awareness and privacy concerns
about the wealth of personal information that can be
inferred by analyzing a recorded speaker’s voice charac-
teristics and manner of expression. Our results indicate
a widespread lack of awareness about the possibilities
of modern voice and speech analysis. Averaged over the
eight types of inferences covered in our questionnaire,
most participants reported to be “not at all” (50.0%) or
only “slightly” (17.6%) aware. Even participants with
professional experience in the ICT ﬁeld scored low on
awareness. Furthermore, while our results for partici-
pants’ level of concern about audio-based inference at-
tacks do not show a clear tendency, many participants
– judging from their text responses – seem to lack the
background knowledge required to assess these threats
in an informed manner. Overall, the ﬁndings of this
study underscore that the complexities of modern data
processing are beyond the comprehension of ordinary
users – which calls into question the notion of “informed
consent,” a cornerstone of most modern data protection
laws, including EU’s GDPR. To prevent consent from
being used as a loophole to excessively reap data from
unwitting individuals, alternative and complementary
technical, organizational, and regulatory safeguards ur-
gently need to be developed. At the very least, inferred
information relating to an individual should be classi-
ﬁed as personal data by law, subject to corresponding
protections and transparency rights. Results from our
within- and between-subject comparisons suggest that
education on data analytics may have an impact on
smart device use, the mechanisms and implications of
which are an interesting avenue for future research.
We thank the anonymous reviewers for their construc-
tive feedback. This research received no speciﬁc grant
from any funding agency in the public, commercial, or
Personal information inference from voice recordings: User awareness and privacy concerns 22
 Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz,
and Parmesh Ramanathan. 2020. Preech: A system for
privacy-preserving speech transcription. In 29th USENIX
Security Symposium. 2703–2720.
 Ranya Alouﬁ, Hamed Haddadi, and David Boyle. 2019.
Emotionless: privacy-preserving speech analysis for voice
assistants. preprint arXiv:1908.03632 (2019).
 Ranya Alouﬁ, Hamed Haddadi, and David Boyle. 2020.
Privacy-preserving Voice Analysis via Disentangled Repre-
sentations. In ACM SIGSAC Conference on Cloud Comput-
ing Security Workshop. 1–14.
 Gillinder Bedi et al. 2015. Automated analysis of free
speech predicts psychosis onset in high-risk youths. npj
Schizophrenia 1 (2015), 15030.
 Hamid Behravan, Ville Hautamäki, Sabato Marco Sinis-
calchi, Tomi Kinnunen, and Chin-Hui Lee. 2015. I-Vector
modeling of speech attributes for automatic foreign accent
recognition. IEEE/ACM Transactions on Audio, Speech,
and Language Processing 24, 1 (2015), 29–41.
 Pascal Belin, Shirley Fecteau, and Catherine Bedard. 2004.
Thinking the voice: neural correlates of voice perception.
Trends in cognitive sciences 8, 3 (2004), 129–135.
 Gergely Biczók and Pern Hui Chia. 2013. Interdependent
privacy: Let me share your data. In Int. Conf. on Financial
Cryptography and Data Security. Springer, 338–353.
 Jordan M Blanke. 2020. Protection for ‘Inferences Drawn’:
A Comparison Between the General Data Protection Reg-
ulation and the California Consumer Privacy Act. Global
Privacy Law Review 1, 2 (2020).
 Daniel Bone, Ming Li, Matthew P Black, and Shrikanth S
Narayanan. 2014. Intoxicated speech detection: A fusion
framework with speaker-normalized hierarchical functionals
and GMM supervectors. Computer Speech & Language 28,
2 (2014), 375–391.
 Virginia Braun and Victoria Clarke. 2012. Thematic Analy-
sis. In APA Handbook of Research Methods in Psychology.
American Psychological Association, Washington, DC.
 Jesse Chandler, Cheskie Rosenzweig, Aaron J Moss,
Jonathan Robinson, and Leib Litman. 2019. Online panels
in social science research: Expanding sampling methods be-
yond Mechanical Turk. Behavior research methods 51, 5
 Varun Chandrasekaran, Kassem Fawaz, Bilge Mutlu, and
Suman Banerjee. 2018. Characterizing Privacy Percep-
tions of Voice Assistants: A Technology Probe Study.
arXiv:1812.00263 [cs] (2018).
 Chola Chhetri and Vivian Genaro Motti. 2019. Eliciting
Privacy Concerns for Smart Home Devices from a User
Centered Perspective. In Information in Contemporary So-
ciety, Natalie Greene Taylor et al. (Eds.). Springer, Cham,
 Eun Kyoung Choe, Sunny Consolvo, Jaeyeon Jung, Beverly
Harrison, Shwetak N. Patel, and Julie A. Kientz. 2012.
Investigating Receptiveness to Sensing and Inference in the
Home Using Sensor Proxies. In UbiComp. ACM, 61–70.
 Wolﬁe Christl. 2017. How Companies Use Data Against
People. Cracked Labs, Vienna.
 Kirsten Crager, Anindya Maiti, Murtuza Jadliwala, and
Jibo He. 2017. Information Leakage through Mobile Mo-
tion Sensors: User Awareness and Concerns. In EuroUSEC.
Internet Society, Paris, France.
 Nicholas Cummins, Alice Baird, and Bjoern W Schuller.
2018. Speech analysis for health: Current state-of-the-art
and the increasing impact of deep learning. Methods 151
 Nicholas Cummins, Maximilian Schmitt, Shahin Amiripar-
ian, Jarek Krajewski, and Björn Schuller. 2017. “You sound
ill, take the day oﬀ”: Automatic recognition of speech af-
fected by upper respiratory tract infection. In Proceedings
of the IEEE EMBC Conference. 3806–3809.
 Christine P Dancey and John Reidy. 2007. Statistics with-
out Maths for Psychology. Pearson Education.
 Evan DeFilippis, Stephen Michael Impink, Madison Singell,
Jeﬀrey T Polzer, and Raﬀaella Sadun. 2020. Collaborat-
ing during Coronavirus: The impact of COVID-19 on the
nature of work. National Bureau of Economic Research,
 Serge Egelman, Raghudeep Kannavara, and Richard Chow.
2015. Is This Thing On? Crowdsourcing Privacy Indicators
for Ubiquitous Sensing Platforms. In Proceedings of the
CHI Conference. ACM, New York, 1669–1678.
 Federal Bureau of Investigation. 2020. 2019 Internet Crime
 Andy Field, Jeremy Miles, and Zoë Field. 2012. Discovering
statistics using R. Sage Publishing, Newbury Park, CA.
 Oﬃce for National Statistics. 2013. The National
294277 (last accessed on 12 September 2021).
 Oﬃce for National Statistics. 2019. Population
estimates for the UK, England and Wales, Scot-
land and Northern Ireland: mid-2019. https://
 International Organization for Standardization. 2019. ISO
20252:ed-3:v1:en (last accessed on 12 September 2021).
 D George and P Mallery. 2003. Reliability analysis. SPSS
for Windows, step by step: a simple guide and reference.
Allyn & Bacon, Boston, MA.
 Stacey Gray. 2016. Always On: Privacy Implications of
Microphone-Enabled Devices. https://www.ftc.gov/
 Sangyeal Han and Heetae Yang. 2018. Understanding
adoption of intelligent personal assistants. Industrial Man-
agement & Data Systems 118, 3 (2018), 618–636.
 David Harborth, Maren Braun, Akos Grosz, Sebastian
Pape, and Kai Rannenberg. 2018. Anreize und Hemmnisse
für die Implementierung von Privacy-Enhancing Technolo-
gies im Unternehmenskontext. In Sicherheit 2018. 29–41.
 David Harborth, Xinyuan Cai, and Sebastian Pape. 2019.
Why Do People Pay for Privacy?. In ICT Systems Security
and Privacy Protection – 34th IFIP TC 11 International
Personal information inference from voice recordings: User awareness and privacy concerns 23
 Abhinav Jain, Minali Upreti, and Preethi Jyothi. 2018.
Improved Accented Speech Recognition Using Accent Em-
beddings and Multi-task Learning. In Proc. Interspeech.
 Huafeng Jin and Shuo Wang. 2018. Voice-based determi-
nation of physical and emotional characteristics of users.
 Selen Hande Kabil, Hannah Muckenhirn, et al. 2018. On
Learning to Identify Genders from Raw Speech Signal Us-
ing CNNs. In Proc. Interspeech. 287–291.
 Predrag Klasnja, Sunny Consolvo, Tanzeem Choudhury,
Richard Beckwith, and Jeﬀrey Hightower. 2009. Exploring
Privacy Concerns about Personal Sensing. In International
Conference on Pervasive Computing. Springer, 176–183.
 Shashidhar G Koolagudi, Sudhamay Maity, Vuppala Anil
Kumar, Saswat Chakrabarti, and K Sreenivasa Rao. 2009.
IITKGP-SESC: speech database for emotion analysis. In
International Conference on Contemporary Computing.
 Jacob Kröger. 2019. Unexpected Inferences from Sensor
Data: A Hidden Privacy Threat in the Internet of Things.
In Internet of Things. Information Processing in an Increas-
ingly Connected World, Leon Strous and Vinton G. Cerf
(Eds.). Springer, Cham, 147–159.
 Jacob Leon Kröger, Leon Gellrich, Sebastian Pape,
Saba Rebecca Brause, and Stefan Ullrich. 2021. Response
data - Survey on privacy impacts of voice & speech analy-
 Jacob Leon Kröger, Jens Lindemann, and Dominik Her-
rmann. 2020. How do app vendors respond to subject
access requests? A longitudinal privacy study on iOS and
Android Apps. In International Conference on Availability,
Reliability and Security. 1–10.
 Jacob Leon Kröger, Otto Hans-Martin Lutz, and Florian
Müller. 2020. What Does Your Gaze Reveal About You?
On the Privacy Implications of Eye Tracking. In Privacy
and Identity Management. Data for Better Living: AI and
Privacy, Samuel Fricker, Michael Friedewald, Stephan
Krenn, Eva Lievens, and Melek Önen (Eds.). Springer,
 Jacob Leon Kröger, Otto Hans-Martin Lutz, and Philip
Raschke. 2020. Privacy Implications of Voice and Speech
Analysis – Information Disclosure by Inference. In Privacy
and Identity Management. Data for Better Living: AI and
Privacy, Samuel Fricker et al. (Eds.). Springer, Cham, 242–
 Jacob Leon Kröger, Otto Hans-Martin Lutz, and Stefan
Ullrich. 2021. The myth of individual control: Mapping the
limitations of privacy self-management. https://ssrn.com/
abstract=3881776. SSRN (2021).
 Jacob Leon Kröger and Philip Raschke. 2019. Is my phone
listening in? On the feasibility and detectability of mobile
eavesdropping. In IFIP Annual Conference on Data and
Applications Security and Privacy. Springer, 102–120.
 Jacob Leon Kröger, Philip Raschke, and Towhidur Rahman
Bhuiyan. 2019. Privacy Implications of Accelerometer
Data: A Review of Possible Inferences. In Proceedings of
the 3rd International Conference on Cryptography, Security
and Privacy (ICCSP). ACM, New York, NY, 81–87.
 Jacob Leon Kröger, Philip Raschke, Jessica Percy Camp-
bell, and Stefan Ullrich. 2021. Surveilling the Gamers:
Privacy Impacts of the Video Game Industry. https:
//ssrn.com/abstract=3881279. SSRN (2021).
 Justin Kruger and David Dunning. 1999. Unskilled and
unaware of it: How diﬃculties in recognizing one’s own
incompetence lead to inﬂated self-assessments. Journal of
Personality and Social Psychology 77, 6 (1999), 1121.
 John Krumm. 2007. Inference Attacks on Location Tracks.
In Pervasive Computing, Anthony LaMarca et al. (Eds.).
Springer, Cham, 127–143.
 Norman J Lass, Karen R Hughes, Melanie D Bowyer, Lu-
cille T Waters, and Victoria T Bourne. 1976. Speaker sex
identiﬁcation from voiced, whispered, and ﬁltered isolated
vowels. J. Acoust. Soc. Am. 59, 3 (1976), 675–678.
 Marianne Latinus and Pascal Belin. 2011. Human voice
perception. Current Biology 21, 4 (2011), R143–R145.
 Josephine Lau, Benjamin Zimmerman, and Florian Schaub.
2018. Alexa, Are You Listening?: Privacy Perceptions, Con-
cerns and Privacy-Seeking Behaviors with Smart Speakers.
Proceedings of the ACM on Human-Computer Interaction
2 (2018), 1–31.
 Josephine Lau, Benjamin Zimmerman, and Florian Schaub.
2018. “Alexa, Stop Recording”: Mismatches between Smart
Speaker Privacy Controls and User Needs. In Symposium
on Usable Privacy and Security.
 Yuting Liao, Jessica Vitak, Priya Kumar, Michael Zim-
mer, and Katherine Kritikos. 2019. Understanding the
Role of Privacy and Trust in Intelligent Personal Assistant
Adoption. In Information in Contemporary Society, Na-
talie Greene Taylor, Caitlin Christian-Lamb, Michelle H.
Martin, and Bonnie Nardi (Eds.). Springer, Cham, 102–
 Daniel J. Liebling and Sören Preibusch. 2014. Privacy
Considerations for a Pervasive Eye Tracking World. In Pro-
ceedings of the 2014 ACM International Joint Conference
on Pervasive and Ubiquitous Computing. 1169–1177.
to Admit Humans Can Listen to Skype Translator and
Cortana Audio. https://social.techcrunch.com/2019/
accessed on 12 September 2021).
 Lydia Manikonda, Aditya Deotale, and Subbarao Kamb-
hampati. 2018. What’s up with Privacy?: User Preferences
and Privacy Concerns in Intelligent Personal Assistants. In
AAAI/ACM Conference on AI, Ethics, and Society. 229–
 Alessandro Mantelero. 2018. AI and Big Data: A blueprint
for a human rights, social and ethical impact assessment.
Computer Law & Security Review 34, 4 (2018), 754–772.
 Rob Matheson. 2016. Watch your tone. https://news.mit.
0120. (last accessed on 12 September 2021).
 Morgan N McCredie and Leslie C Morey. 2019. Who are
the Turkers? A characterization of MTurk workers using the
personality assessment inventory. Assessment 26, 5 (2019),
 Mary L McHugh. 2012. Interrater Reliability: The Kappa
Statistic. Biochemia Medica 22, 3 (2012), 276–282.
Personal information inference from voice recordings: User awareness and privacy concerns 24
 Emily McReynolds, Sarah Hubbard, Timothy Lau, Aditya
Saraf, Maya Cakmak, and Franziska Roesner. 2017. Toys
That Listen: A Study of Parents, Children, and Internet-
Connected Toys. In Proceedings of the CHI Conference.
 Maryam Mehrnezhad, Ehsan Toreini, Siamak F. Shahan-
dashti, and Feng Hao. 2018. Stealing PINs via Mobile
Sensors: Actual Risk versus User Perception. International
Journal of Information Security 17, 3 (2018), 291–313.
 Cristina Mihale-Wilson, Jan Zibuschka, and Oliver Hinz.
2017. About User Preferences and Willingness to Pay
for a Secure and Privacy Protective Ubiquitous Personal
Assistant. In Proceedings of the 25th European Conference
on Information Systems (ECIS).
 George R Milne, George Pettinico, Fatima M Hajjat, and
Ereni Markos. 2017. Information sensitivity typology: Map-
ping the degree and type of risk consumers perceive in
personal data sharing. Journal of Consumer Aﬀairs 51, 1
 Aarthi Easwara Moorthy and Kim-Phuong L. Vu. 2015.
Privacy Concerns for Use of Voice Activated Personal
Assistant in the Public Space. International Journal of
Human-Computer Interaction 31, 4 (2015), 307–335.
 Evelyne Moyse. 2014. Age estimation from faces and
voices: a review. Psychologica Belgica 54, 3 (2014), 255–
 Dibya Mukhopadhyay, Maliheh Shirvanian, and Nitesh Sax-
ena. 2015. All your voices are belong to us: Stealing voices
to fool humans and machines. In European Symposium on
Research in Computer Security. Springer, 599–621.
 Andreas Nautsch et al. 2019. Preserving privacy in speaker
and speech characterisation. Computer Speech & Language
58 (2019), 441–480. http://dx.doi.org/10.1016/j.csl.2019.
 Andreas Nautsch, Catherine Jasserand, Els Kindt, Mas-
similiano Todisco, Isabel Trancoso, and Nicholas Evans.
2019. The GDPR & Speech Data: Reﬂections of Legal and
Technology Communities, First Steps Towards a Common
Understanding. Proc. Interspeech (2019), 3695–3699.
 Andreas Nautsch, Jose Patino, Natalia Tomashenko, Ju-
nichi Yamagishi, Paul-Gauthier Noe, Jean-Francois Bonas-
tre, Massimiliano Todisco, and Nicholas Evans. 2020. The
Privacy ZEBRA: Zero Evidence Biometric Recognition As-
sessment. In Proc. Interspeech.
 Evan Niu. 2020. Smart-Speaker Volumes Expected to
Jump Next Year. https://www.nasdaq.com/articles/
smart-speaker-volumes-expected- to-jump-next- year-2020-
10-23 (last accessed on 12 September 2021).
 Data Ethics Commission of the Federal Government. 2019.
Opinion of the Data Ethics Commission. German Federal
Ministry of Justice and Consumer Protection, Berlin.
 Tobias Olsson, Ulli Samuelsson, and Dino Viscovi. 2019.
At risk of exclusion? Degrees of ICT access and literacy
among senior citizens. Information, Communication &
Society 22, 1 (2019), 55–72.
 Kuan Ee Brian Ooi, Margaret Lech, and Nicholas B Allen.
2012. Multichannel weighted speech classiﬁcation system
for prediction of major depression in adolescents. IEEE
Transactions on Biomedical Engineering 60, 2 (2012), 497–
 Yaakov Ophir, Itay Sisso, Christa SC Asterhan, Refael
Tikochinski, and Roi Reichart. 2020. The Turker Blues:
Hidden Factors Behind Increased Depression Rates Among
Amazon’s Mechanical Turkers. Clinical Psychological Sci-
ence 8, 1 (2020), 65–83.
 Elleen Pan, Jingjing Ren, Martina Lindorfer, Christo Wil-
son, and David Chones. 2018. Panoptispy: Characterizing
Audio and Video Exﬁltration from Android Applications. In
Proceedings on Privacy Enhancing Technologies. 33–50.
 A Parasuraman and Charles L Colby. 2015. An Updated
and Streamlined Technology Readiness Index: TRI 2.0.
Journal of Service Research 18, 1 (2015), 59–74.
 Sarah Perez. 2019. Report: Voice assistants in use to triple
to 8 billion by 2023. https://techcrunch.com/2019/02/
12/report-voice-assistants-in- use-to-triple- to-8-billion-by-
2023 (last accessed on 12 September 2021).
 Tim Polzehl. 2016. Personality in Speech: Assessment and
Automatic Classiﬁcation. Springer, Cham.
 Lesandro Ponciano, Pedro Barbosa, Francisco Brasileiro,
Andrey Brito, and Nazareno Andrade. 2017. Designing for
Pragmatists and Fundamentalists: Privacy Concerns and
Attitudes on the Internet of Things. In Brazilian Sympo-
sium on Human Factors in Computing Systems.
 Andrew Raij, Animikh Ghosh, Santosh Kumar, and Mani
Srivastava. 2011. Privacy Risks Emerging from the Adop-
tion of Innocuous Wearable Sensors in the Mobile Environ-
ment. In Proceedings of the CHI Conference. ACM.
 Kari Rea. 2013. Glenn Greenwald: Low-Level NSA
Analysts Have ’Powerful and Invasive’ Search Tool.
 Elissa M Redmiles, Sean Kross, and Michelle L Mazurek.
2019. How well do my results generalize? Comparing se-
curity and privacy survey results from MTurk, web, and
telephone samples. In 2019 IEEE Symposium on Security
and Privacy. 1326–1343.
 Seyed Omid Sadjadi, Sriram Ganapathy, and Jason W Pele-
canos. 2016. Speaker age estimation on conversational
telephone speech using senone posterior based i-vectors. In
2016 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). 5040–5044.
 Florian Schiel. 2011. Perception of alcoholic intoxication in
speech. In Proc. Interspeech. 3281–3284.
 Björn Schuller, Stefan Steidl, Anton Batliner, Elmar Nöth,
Alessandro Vinciarelli, Felix Burkhardt, Rob van Son, Felix
Weninger, Florian Eyben, Tobias Bocklet, Gelareh Mo-
hammadi, and Benjamin Weiss. 2015. A Survey on Per-
ceived Speaker Traits: Personality, Likability, Pathology,
and the First Challenge. Computer Speech & Language 29,
1 (2015), 100–131.
 Björn Schuller, Stefan Steidl, Anton Batliner, Florian
Schiel, Jarek Krajewski, Felix Weninger, and Florian Ey-
ben. 2014. Medium-term speaker states — A review on
intoxication, sleepiness and the ﬁrst challenge. Computer
Speech & Language 28, 2 (2014), 346–374.
 Amit Kumar Sikder, Giuseppe Petracca, Hidayet Aksu,
Trent Jaeger, and A. Selcuk Uluagac. 2018. A Survey on
Sensor-Based Threats to Internet-of-Things (IoT) Devices
and Applications. arXiv:1802.02041 [cs] (2018).
Personal information inference from voice recordings: User awareness and privacy concerns 25
 Dave Smith. 2019. MicrophoneGate: The World’s Biggest
Tech Companies Were Caught Sending Sensitive Audio
from Customers to Human Contractors. Here’s Where
They Stand Now. https://www.businessinsider.com/
 Daniel J Solove. 2013. Privacy self-management and the
consent dilemma. Harvard Law Review 126 (2013), 1880.
 Raphael Spreitzer, Veelasha Moonsamy, Thomas Korak,
and Stefan Mangard. 2018. Systematic Classiﬁcation of
Side-Channel Attacks: A Case Study for Mobile Devices.
IEEE Communications Surveys & Tutorials 20, 1 (2018),
 Anna Ståhlbröst, Annika Sällström, and Danilo Hollosi.
2014. Audio Monitoring in Smart Cities – an Information
Privacy Perspective. In 12th International Conference e-
 Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos,
Mathieu Lagrange, and Mark D Plumbley. 2015. Detection
and classiﬁcation of acoustic scenes and events. IEEE
Transactions on Multimedia 17, 10 (2015), 1733–1746.
 SoSci Survey. 2020. The Solution for Professional Online
 Dan Svantesson and Roger Clarke. 2010. Privacy and
consumer risks in cloud computing. Computer Law &
Security Review 26, 4 (2010), 391–397.
 Monorama Swain, Aurobinda Routray, and P. Kabisatpa-
thy. 2018. Databases, Features and Classiﬁers for Speech
Emotion Recognition: A Review. International Journal of
Speech Technology 21, 1 (2018), 93–120.
 VoiceSense. 2020. Speech Analysis as a Talent Recruit-
ing and Retention Tool. https://www.voicesense.com/
 Sandra Wachter and Brent Mittelstadt. 2019. A right to
reasonable inferences: Re-thinking data protection law in
the age of big data and AI. Colum. Bus. L. Rev. (2019),
 Kelly Walters, Dimitri A Christakis, and Davene R Wright.
2018. Are Mechanical Turk worker samples representative
of health status and health behaviors in the US? PlOS One
13, 6 (2018), e0198835.
 Wikipedia. [n.d.]. Nothing to hide argument. https:
accessed on 12 September 2021).
 Wikipedia. [n.d.]. Telephone tapping. https://en.
wikipedia.org/wiki/Telephone_tapping (last accessed on
12 September 2021).
 Heng Xu, Tamara Dinev, Jeﬀ Smith, and Paul Hart. 2011.
Information privacy concerns: Linking individual perceptions
with institutional privacy assurances. Journal of the Associ-
ation of Information Systems 12, 12 (2011), 798–824.
 Eric Zeng, Shrirang Mare, and Franziska Roesner. 2017.
End user security and privacy concerns with smart homes.
In Symposium on Usable Privacy and Security (Santa
Clara, CA, USA). 65–80.
 Serena Zheng, Noah Apthorpe, Marshini Chetty, and Nick
Feamster. 2018. User Perceptions of Smart Home IoT
Privacy. Proceedings of the ACM on Human-Computer
Interaction 2 (2018), 1–20.
A Survey questionnaire
1. What is your gender? (Female/Male/Other)
2. How old are you?
3. Privacy awareness4
– I am aware of the privacy issues and practices
in our society.5
– I follow the news and developments about the
privacy issues and privacy violations.5
– I keep myself updated about privacy issues and
the solutions that companies and the govern-
ment employ to ensure our privacy.5
4. Disposition to value privacy4
– Compared to others, I am more sensitive about
the way companies handle my personal infor-
– To me, it is the most important thing to keep
my information privacy.5
– Compared to others, I tend to be more con-
cerned about threats to my information pri-
5. Previous privacy experience4
– How often have you been a victim of what you
felt was an improper invasion of privacy?6
– How much have you heard or read during the
past year about the use and potential misuse of
the information collected from the Internet?6
– How often have you experienced incidents where
your personal information was used by a com-
pany without your authorization?6
6. Voice-controlled virtual assistant7
A virtual assistant is a software agent that can
perform tasks based on voice commands, without
the requirement for keyboard input. Some exam-
ples of commercially available virtual assistants are
Apple’s Siri, Amazon Alexa, Microsoft’s Cortana,
and Google Assistant. Among other capabilities,
virtual assistants can set reminders, manage con-
4The constructs Privacy Awareness (PA), Disposition to Value
Privacy (DVP), and Previous Privacy Experience (PPE) are
adapted from Xu et al. 
5Item was measured on a 5-point Likert scale: Strongly dis-
agree, Somewhat disagree, Neutral, Somewhat agree, Strongly
6Item was measured on a 5-point Likert scale: Never, Rarely,
Sometimes, Often, Very often
7The order of items shown here reﬂects questionnaire Grp-A.
In Grp-B, this component (№6, including the two questions) is
positioned after the educational video, replacing question №15.
See Sect. 4.1 for exlanation.
Personal information inference from voice recordings: User awareness and privacy concerns 26
tacts, play music, take purchase orders, send mes-
sages and calls, provide weather reports, and man-
age smart home devices.
– How often do you use a voice-controlled virtual
assistant in your daily life?6
– Are you interested in starting or continuing to
use a voice-controlled virtual assistant?8
7. Please attentively watch this video (1:44 minutes)
about the privacy implications of voice data.9
8. Please enter the code displayed in the video. (not
9. Before you watched the video, were you aware that
these types of information can be inferred from voice
– Geographical origin10
– Gender and age10
– Mental health information10
– Physical health information10
– Level of intoxication10
– Moods and emotions10
– Sleepiness and fatigue10
– Personality traits10
10. How worried are you about these possible infer-
11. Why do you feel this way? Please explain your rea-
soning in two or more sentences.
12. Prior to this questionnaire, how often have you
consciously thought about this issue when using a
13. What do you think, how common is it for companies
to draw such inferences from voice recordings?12
14. Please rate how concerned you would be if a com-
pany used voice recordings to infer personal infor-
mation about you without your awareness.
– Geographical origin13
– Gender and age13
8Item was measured on a 5-point Likert scale: Not at all in-
terested, Slightly interested, Somewhat interested, Quite inter-
ested, Very interested
9Video clip available here: https://youtu.be/Gr22YqS1_VA.
10 Item was measured on a 5-point Likert scale: Not at all,
Slightly, Somewhat, Quite well, Very well
11 Item was measured on a 5-point Likert scale: Not at all wor-
ried, Slightly worried, Somewhat worried, Quite worried, Very
12 Item was measured on a 5-point Likert scale: Very uncom-
mon, Somewhat uncommon, Undecided, Somewhat common,
13 Item was measured on a 5-point Likert scale: Unconcerned,
Slightly concernced, Somewhat concernce, Quite concerned,
– Mental health information13
– Physical health information13
– Level of intoxication13
– Moods and emotions13
– Sleepiness and fatigue13
– Personality traits13
15. Previously in this survey, you were asked about your
interest in voice-controlled virtual assistants, such
as Apple’s Siri and Amazon Alexa. You are now
asked about this a second time. Please answer based
on your current thoughts and feelings, independent
from your previous response.
– Are you interested in starting or continuing to
use a voice-controlled virtual assistant?8
16. Please indicate the extent to which you agree or
disagree with each statement.14
– Other people come to me for advice on new
– In general, I am among the ﬁrst in my circle of
friends to acquire new technology when it ap-
– I can usually ﬁgure out new high-tech products
and services without help from others5
– I keep up with the latest technological develop-
ments in my areas of interest5
– I enjoy the challenge of ﬁguring out high-tech
– I ﬁnd I have fewer problems than other people
in making technology work for me5
– I prefer to use the most advanced technology
– Show that you are paying attention by skipping
this row without making a tick5
– I ﬁnd new technologies to be mentally stimulat-
– Learning about technology can be as rewarding
as the technology itself5
17. Do you own any devices that have a microphone?
Select all that apply.
– Smart speaker
– Car with voice control interface
– Voice-enabled remote control
14 This construct – Innovativeness (INNO) – is adapted from
Parasuraman and Colby 
Personal information inference from voice recordings: User awareness and privacy concerns 27
– Other (please specify)
18. What is the highest level of education you have ob-
– Finished school with no qualiﬁcations
– Still in secondary school
– GCSE Level education (e. g., GCSE, O-Levels,
– A-Level education (e. g., A, AS, S-Levels, High-
– Some undergraduate education (i. e., university
examinations but not completed degree)
– Degree or Graduate education (e. g., BSc, BA)
– Post-graduate education (e. g., MSc, MA)
– Doctorate degree
– Vocational education (e. g., NVQ, HNC, HND)
– Other degree or qualiﬁcation (please specify)
19. Do you have professional experience in the following
– Data protection law15
– Computer science15
– Data mining15
– Information technology security (IT security)15
20. What is your monthly net income? (Net income is
deﬁned as your total income after tax and social
15 Item was measured on a 5-point Likert scale: No experience,
Little experience, Some experience, Much experience, Very much
– I do not have a personal income
– Less than £250
– £250 up to £500
– £500 up to £1000
– £1000 up to £1500
– £1500 up to £2000
– £2000 up to £3000
– £3000 up to £4000
– £4000 up to £5000
– £5000 or more
– Decline to answer
21. To show if we have expressed ourselves clearly
enough, please tick the description that best reﬂects
the topic of this study.
– Health eﬀects of urban air pollution
– Privacy concerns related to voice recordings
– Telecommunications in India
– Professional music production
– Health concerns about wireless device radiation
– Landlord and tenant privacy rights
B Correlation table
Spearman’s rank correlations between participant de-
mographics and paricipants awareness for audio-based
inferences are shown in Table 6, along with Bonferroni-
corrected signiﬁcance levels.
Table 6. Spearman’s rank correlations between participant demographics and awareness for audio-based inferences
AW_DEM 95% CI AW_STATE 95% CI AW_TRAIT 95% CI
Age –0.24*** (–1.00 to –0.19) –0.18*** (–1.00 to –0.12) –0.19*** (–1.00 to –0.14)
Gender 0.05 (–0.03 to 0.12) 0.11*(0.03 to 0.19) 0.11*(0.04 to 0.18)
Income 0.08 (0.00 to 0.17) 0.06 (–0.03 to 0.14) 0.04 (–0.04 to 0.12)
Education 0.29*** (0.23 to 1.00) 0.17*** (0.10 to 1.00) 0.14** (0.07 to 1.00)
INNO 0.24*** (0.18 to 1.00) 0.18*** (0.12 to 1.00) 0.21*** (0.15 to 1.00)
PA 0.18*** (0.12 to 1.00) 0.20*** (0.14 to 1.00) 0.19*** (0.12 to 1.00)
DVP 0.08 (0.01 to 0.17) 0.05 (–0.02 to 0.13) 0.09 (0.03 to 0.17)
PPE 0.19*** (0.12 to 1.00) 0.18*** (0.11 to 1.00) 0.18*** (0.12 to 1.00)
EXP_DP 0.20*** (0.14 to 1.00) 0.16*** (0.10 to 1.00) 0.14** (0.07 to 1.00)
EXP_DM 0.28*** (0.22 to 1.00) 0.26*** (0.20 to 1.00) 0.28*** (0.22 to 1.00)
EXP_CS 0.27*** (0.21 to 1.00) 0.22*** (0.16 to 1.00) 0.26*** (0.20 to 1.00)
EXP_IS 0.25*** (0.18 to 1.00) 0.22*** (0.15 to 1.00) 0.22*** (0.14 to 1.00)
*p < 0.05, ** p < 0.01, *** p < 0.001
AW_: average level of awareness about audio-based inference of demographics (DEM) / short- and medium-
term states (STATE) / phyisical and psychological traits (TRAIT); INNO: innovativeness; PA: privacy aware-
ness; DVP: disposition to value privacy; PPE: previous privacy experience; EXP_: professional experience
in data protection law (DP) / data mining (DM) / computer science (CS) / IT security (IS); CI: conﬁdence