Conference PaperPDF Available

TILES audio recorder: an unobtrusive wearable solution to track audio activity

Authors:

Abstract and Figures

Most existing speech activity trackers used in human subject studies are bulky, record raw audio content which invades participant privacy, have complicated hardware and non-customizable software, and are too expensive for large-scale deployment. The present effort seeks to overcome these challenges by proposing the TILES Audio Recorder (TAR) - an unobtrusive and scalable solution to track audio activity using an affordable miniature mobile device with an open-source app. For this recorder, we make use of Jelly Pro Mobile, a pocket-sized Android smartphone, and employ two open-source toolkits: openSMILE and Tarsos-DSP. Tarsos-DSP provides a Voice Activity Detection capability that triggers openSMILE to extract and save audio features only when the subject is speaking. Experiments show that performing feature extraction only during speech segments greatly increases battery life, enabling the subject to wear the recorder up to 10 hours at time. Furthermore, recording experiments with ground-truth clean speech show minimal distortion of the recorded features, as measured by root mean-square error and cosine distance. The TAR app further provides subjects with a simple user interface that allows them to both pause feature extraction at any time and also easily upload data to a remote server.
Content may be subject to copyright.
TILES Audio Recorder: An unobtrusive wearable solution to
track audio activity
Tiantian Feng
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
tiantiaf@usc.edu
Amrutha Nadarajan
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
nadaraja@usc.edu
Colin Vaz
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
cvaz@usc.edu
Brandon Booth
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
bbooth@usc.edu
Shrikanth Narayanan
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
shri@sipi.usc.edu
ABSTRACT
Most existing speech activity trackers used in human subject studies
are bulky, record raw audio content which invades participant pri-
vacy, have complicated hardware and non-customizable software,
and are too expensive for large-scale deployment. The present ef-
fort seeks to overcome these challenges by proposing the TILES
Audio Recorder (TAR) - an unobtrusive and scalable solution to
track audio activity using an aordable miniature mobile device
with an open-source app. For this recorder, we make use of Jelly
Pro Mobile, a pocket-sized Android smartphone, and employ two
open-source toolkits: openSMILE and Tarsos-DSP. Tarsos-DSP pro-
vides a Voice Activity Detection capability that triggers openSMILE
to extract and save audio features only when the subject is speaking.
Experiments show that performing feature extraction only during
speech segments greatly increases battery life, enabling the subject
to wear the recorder up to 10 hours at time. Furthermore, recording
experiments with ground-truth clean speech show minimal distor-
tion of the recorded features, as measured by root mean-square
error and cosine distance. The TAR app further provides subjects
with a simple user interface that allows them to both pause fea-
ture extraction at any time and also easily upload data to a remote
server.
CCS CONCEPTS
Human-centered computing
Ubiquitous and mobile com-
puting design and evaluation methods;
KEYWORDS
Audio, wearable sensing, audio processing, openSMILE, privacy,
human subjects study, audio feature recorder
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
WearSys ’18, June 10, 2018, Munich, Germany
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5842-2/18/06. . . $15.00
https://doi.org/10.1145/3211960.3211975
ACM Reference Format:
Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon Booth, and Shrikanth
Narayanan. 2018. TILES Audio Recorder: An unobtrusive wearable solution
to track audio activity. In WearSys ’18: 4th ACM Workshop on Wearable
Systems and Applications, June 10, 2018, Munich, Germany. ACM, New York,
NY, USA, Article 4, 6 pages. https://doi.org/10.1145/3211960.3211975
1 INTRODUCTION
Rapid advances in signal acquisition and sensing technology have
produced miniaturized and low-power sensors capable of moni-
toring people’s activities, physiological functions, and their envi-
ronment [
16
]. When coupled with recent large-scale data analysis
techniques, they show great promise for many applications, such
as health monitoring, rehabilitation, tness, and well being [
14
].
These advances have been primarily enabled by the ubiquitous
mobile phone, which facilitates sensor setup, data transfer, and user
feedback [
11
]. These non-invasive devices have made the design
and implementation of medical, psychological, and behavioral re-
search studies easier by reducing participant burden. We focus on
applications of these technologies to audio recording of human
speech in the wild.
Audio recording devices have come a long way, from bulky
phonographs to hand-held microphones and lapel mics. The legacy
systems are cumbersome, obtrusive, expensive, and not scalable.
They can also potentially invade privacy by recording and storing
raw audio. Mobile phones, laptops, and wearables are becoming
a part of our everyday life. By using the miniature mics on these
devices, a digital recorder [
15
][
13
][
9
] can overcome some of the
drawbacks of the legacy systems, like obtrusiveness, and poor scala-
bility. These modern devices are also capable of capturing day-long
audio samples in naturalistic environments. Most digital recording
solutions, like the EAR [
15
], aord privacy by allowing the user to
delete recordings retrospectively. This process demands additional
eort from the user to ensure their privacy is not violated and
potentially jeopardizes the privacy of the people the user interacts
with.
A recently-developed audio recording solution for human sub-
jects studies is EAR [
15
]. EAR is an app that runs on a personal
smart phone and records small fractions of audio through the day
at regular intervals. The recording scheme senses audio only 5%
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
of the day and does not have an intelligent recording trigger to
initiate the recording. This could potentially lead to recordings
containing no audio activity or failing to record salient audio ac-
tivity. Other notable eorts in the development of wearable audio
recorders include devices like the Sociometer [
7
]. This device con-
tains a microphone, motion sensor, and proximity sensor embedded
in a shoulder pad that can be strapped on. However, this device is
not commercially available which makes large-scale deployment
prohibitive. Soundsense[
13
] is another audio wearable device that
classies dierent auditory scenes (music, speech, ambient sound)
and events (gender, music, speech analysis) surrounding the user. It
runs on mobile devices so it could scale to large study populations,
but the platform is not publicly available. Unfortunately, recording
solutions using phones usually suer from poor recording qual-
ity because people commonly carry phones in their pockets or
handbags. A study by VanDam [
19
] has shown that denser mate-
rial composition and indirect microphone placement introduces
amplitude loss at frequencies higher than 1 kHz.
In this paper, we introduce a new digital recording system called
TILES
1
Audio Recorder (TAR), which enables passive sensing of
audio activity in sensitive environments where recording raw audio
is not a viable option. The recorder comes with an audio activity
detector and a customizable recording scheme to initiate audio
feature extraction as required by the demands of the investigator.
The recorded features are not audible, but a large number of studies
have successfully predict the emotions or stress from the speech
features [5].
In the following sections, we give an overview of the TAR system
and provide experimental results about its expected battery life and
robustness of audio activity detection, and we also analyze the audio
feature degradation at various distances from an audio source. Our
results show that the system works as intended and introduces
negligible feature degradation.
2 TAR OVERVIEW
In this section, we provide a detailed description of the hardware
and software comprising the TAR system. TAR is a combination of
a light-weight Android mobile device and an Android app to record
audio features.
2.1 Motivation
As we described in Section 1, user privacy and unobtrusiveness
are critical factors in designing a wearable audio recorder. In this
study, we propose TAR, which is especially useful for deploying
in environments like a hospital where privacy protections restrict
raw audio recording. Moreover, TAR is designed to be unobtrusive
and comfortable when worn. In addition, investigators can easily
reproduce the setup given that TAR runs on consumer hardware
and the software is freely available on Github. TAR introduces
various features to address many study design challenges when
using wearable audio recorders: 2
aordable cost allows scaling to a large number of partici-
pants
1TILES: Tracking IndividuaL pErformance using Sensors
2TAR github: https://github.com/tiantiaf0627/TAR
Figure 1: A recommended setup for TAR
ease of placing the sensor close to the wearer’s mouth pro-
vides consistently high recording quality
extracting and storing audio features and immediately dis-
carding raw audio protects user privacy
an intuitive user interface enables wireless data transfer and
data capture to be paused and resumed at any time
2.2 Hardware
TAR is designed to run on Android platforms because it is open
source and runs on a variety of devices including small, lightweight,
budget-friendly portable smartphones like the Jelly Pro [
2
]. The Jelly
Pro smartphone, used in this work, has the following specications:
Dimensions and weight: 3.07x4.92x1.97 in, 60.4 g
Processor: Quad-core processor, 1.1 GHz
Battery capacity: 950 mAh
Memory: 2 GB RAM / 16 GB ROM
Sensors: microphone, camera, compass, gyroscope
A complete hardware setup consists of a Jelly Pro phone, a phone
case, and a metal badge clip with PVC straps. The phone case is
commercially available, and Jelly Pro is packed into the case. A
metal badge clip, which is fastened to the case, is clipped to a
user’s clothes near the collar. With this setup, the distance from
the wearer’s mouth to the phone’s integrated microphone typically
ranges from 15 cm to 30 cm. A recommended setup is shown in
Fig. 1.
2.3 Software
In this section, we describe the TAR Android application and the var-
ious service components and open source tools that enable record-
ing of audio information.
App services
The TAR Android application starts when the
mobile phone switches on. The TAR app includes the following
services running in the background:
a Voice Activity Detection (VAD) service that runs periodi-
cally to listen for speech and if detected it triggers the audio
feature extraction service
an audio feature extraction service to extract audio features
from the raw audio stream. The raw audio is immediately
deleted after the features are extracted and only the features
are saved.
a battery service to record the battery percentage informa-
tion every 5 minutes. The function of the service is to opti-
mize the battery performance.
TILES Audio Recorder: An unobtrusive wearable solution to track audio activity WearSys ’18, June 10, 2018, Munich, Germany
Figure 2: Recording control logic output with a simulated data stream
a data upload service to securely transfer data to a remote
server over WiFi.
Open-source libraries
Most current mobile phone speech pro-
cessing libraries rely on a frontend-backend architecture. The fron-
tend running on the phone typically record and transmit audio
snippets to the backend over an Internet connection. The back-
end, which commonly runs on a remote server, retrieves the audio
recordings, and performs feature extraction. Such a solution reduces
computational burden on the phone and decouples data acquisition
and processing. However, in our experience, fewer participants are
willing to wear devices that attempt to transmit the raw audio. To
allow TAR to be used where security and privacy are a concern, the
TAR app runs real-time speech processing engines on the phone to
avoid transmitting raw audio.
In our TAR application, we incorporate two real-time speech
processing libraries: openSMILE [
8
] and TarsosDSP [
17
]. TarsosDSP
is a Java library for audio processing, which outputs real-time audio
energy and pitch. It serves as a straight-forward implementation
of a VAD in TAR. OpenSMILE is a tool for extracting a wide range
of features from audio signals. The extracted audio features from
openSMILE have been used for classifying emotional states and
speaker properties [
8
]. In our application, the feature extraction
runs entirely on the mobile device and an Internet connection is
not required. One other benet TAR gains as a result of using
openSMILE is that it is highly congurable and allows investigators
to extract various combinations of audio features to suit dierent
purposes. Besides, we have also tested other potential DSP libraries
including Funf [
4
]. However, Funf only extracts FFT and MFCCs
from the audio stream, thus researchers are constrained in acquiring
other valuable features from the audio.
3 TAR RECORDING SCHEME
In this section, we describe the VAD-triggered recording scheme
used by TAR. We propose a recording scheme which only attempts
to record audio features when audio activity is detected in order
to reduce storage requirements and increase battery life. We use
Tarsos-DSP [
17
] to implement the VAD. The TarsosDSP outputs
the audio energy data every 10 ms, which is then used to determine
the presence of audio activity by TAR.
Figure 3: TAR Battery life vs To
3.1 Recording Parameters
The TAR initiates feature recording when the VAD service is active
and the recording trigger is on.
Ton
species the duration that the
VAD service is run, and
To
controls the length of the idle state
between two consecutive VAD services. A larger
To
means a longer
idle state where no VAD service take place. In addition, we monitor
the audio energy output from Tarsos-DSP to ensure that the feature
extraction happens only when there is some audio activity. Two
additional parameters control when the recording trigger is set
to on: hand
.his set as a threshold for speech energy and
sets the minimum time that speech energy needs to stay above h
to trigger feature extraction. Once triggered, TAR runs the audio
feature extraction service for lseconds. In TAR, we also add the
exibility to enable audio feature recordings periodically even when
VAD service failed to detect the speech in order to sample some
environment information.
The recording parameters in TAR are tunable and should be
adjusted to provide the highest resolution of audio samples for the
duration required by the researcher. Figure 3 presents the average
battery operating time of TAR for dierent
To
(second) values. We
tested and measured the battery operating time for 20 dierent Jelly
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
Figure 4: Schematic of experimental setup for examining
VAD accuracy and feature degradation where d1=
15
,d2=
20,d3=25,d4=30 cm
Pro devices for each value of
To
with h
=−∞
so the VAD service
triggered audio feature recording every time.
4 RESULTS
As described in Section 3, TAR uses a VAD to extract audio features
only during regions of audio activity. As the device is intended for
use in the wild, we tested the VAD accuracy in noisy scenarios as a
function of
h
. We also inspect if the device introduces any quality
degradation in the features extracted. The following subsections
describe the experimental setup and discuss the results of the VAD
accuracy and feature degradation experiments.
4.1 Experimental Setup
To test VAD accuracy and feature distortion, we proposed a record-
ing setup to allow TAR to record speech amplied through a speaker.
We synthetically created two sets of audio, one for testing VAD
accuracy and the other for testing feature degradation. For the VAD
experiment, we randomly chose 250 gender-balanced utterances
from the TIMIT database [
10
]. We concatenated all utterances and
introduced silence of random duration (between 2to 12 seconds) be-
tween utterances to simulate non-continuous speech. Additionally,
we added noise from the DEMAND database [
18
] to the concate-
nated utterances at 10 dB, 5dB, and 0dB SNR levels to test the VAD
performance in noisy speech. For the feature degradation experi-
ment, we randomly sampled 1000 gender-balanced utterances from
the TIMIT database and concatenated them into one le.
The audio les were then played through a loud-speaker. Mul-
tiples TARs set at distances 15 cm, 20 cm, 25 cm, and 30 cm from
the speaker recorded the audio playback simultaneously. We chose
these distances to mimic the typical range of distances the TAR
would be from a user’s mouth during use. Figure 4 shows a diagram
of the experimental setup. For the VAD accuracy experiment, we
chose to store outputs from the Tarsos-DSP and later set dier-
ent thresholds for hand
on the stored output. For the feature
quality experiment, we modied the operation of TAR to record
continuously. The recordings from the devices were compared to
the features extracted from the raw audio le.
4.2 VAD Accuracy Experiment
We set TAR at a distance of 20 cm from the speaker in the VAD
accuracy experiment as it is a typical distance from a user’s mouth.
In order to evaluate our Tarsos-DSP VAD scheme, we rst output
a VAD result from the clean audio source le using the
vadsohn
function provided in the VOICEBOX speech processing toolbox
[
3
]. The VAD result is a binary decision on a frame-by-frame basis,
Figure 5: VAD prediction accuracy as a function of energy
threshold husing Tarsos-DSP library
Figure 6: Confusion matrix of VAD output (%) for using
Tarsos-DSP library and baseline output, where h=56 dB
where each frame is 10 ms in length. Following this, we manually
correct the
vadsohn
output in the regions that gives false predic-
tions to produce a baseline VAD output. Meanwhile, we use the
Tarsos-DSP output to generate the VAD decisions on the recorded
audio in the clean and noisy conditions. As we stated in Section 2,
VAD output is especially sensitive to
h
. To choose the best
h
, we
vary
h
through
65 dB to
50 dB by 1dB with
xed at 100 ms.
Finally, we compare the baseline VAD output and Tarsos-DSP VAD
outputs at each h.
Figure 5 presents the VAD prediction accuracy in percentage with
respect to
h
. We observe that VAD accuracy rates initially increase
with
h
and then decrease signicantly. Dierence in the prediction
accuracy among four speech SNR conditions is less signicant when
h<
64 dB or
h>
58 dB. The prediction accuracy reaches the peak
at
56 dB with clean speech,
56 dB with SNR=10 dB,
55 dB
with SNR=5dB and
54 dB with SNR=0dB. Figure 6 presents the
confusion matrix of VAD outputs between Tarsos-DSP and baseline
with
h=
56 dB. We observe that in clean speech and high SNR
(10 dB, 5dB) conditions, the negative predictive rate is above 75%,
while in poor SNR (0dB) condition, it drops to 68
.
98%. In addition,
we note that the precision rates vary slightly (
<
5%) across SNR
conditions. We advise users to setup similar recording experiments
to calibrate hfor own usage.
4.3 Feature Quality Experiment
In this experiment, we apply the emobase cong setting [
1
] to ex-
tract the openSMILE low-level descriptors (LLDs) from the ground
TILES Audio Recorder: An unobtrusive wearable solution to track audio activity WearSys ’18, June 10, 2018, Munich, Germany
Table 1: Deviation of LLDs in utterance-region
RMSE (root-mean-squared error)
LLD 15cm 20cm 25cm 30cm
F0(Hz) 66.99 55.86 66.33 65.65
F0Env(Hz) 45.87 41.10 51.46 45.13
Loudness 0.863 0.507 0.352 0.352
MFCC[1-14] 45.52 43.77 41.45 42.81
LSP[1-8] 0.551 0.492 0.444 0.479
Cosine Distance
LLD 15cm 20cm 25cm 30cm
MFCC[1-14] 0.199 0.172 0.175 0.184
LSP[1-8] 0.00195 0.00153 0.00139 0.00150
truth concatenated TIMIT le and recordings. The feature set con-
tains 26 acoustic Low-Level Descriptors(LLDs), such as voicing re-
lated features(Pitch, voicing probability), energies (Intensity, Loud-
ness), zero-crossing rate(ZCR) and cepstral information(MFCCs
1-14). The proposed LLDs are extracted every 10 ms using a 25 ms
Hamming window
We compared the LLDs by measuring the root-mean-squared
error (RMSE) and cosine distance between the features extracted
from the source. Some features, however, are sensitive to energy
levels, and they often show higher degradation in silence regions.
Thus, we decided to compare the feature values only in utterance
regions. Firstly, we perform the sanity check on the raw signals. We
use the
vadsohn
function to remove signals in the silence regions,
and we obtain the RMSE between source audio signal and recorded
signal in utterance regions are 0
.
1297 (15 cm), 0
.
0790 (20 cm), 0
.
0590
(25 cm), and 0.0598 (30 cm).
We then use the
vadsohn
function to remove LLDs in the silence
regions. RMSE and cosine distance between ground truth LLDs and
TAR recorded LLDs in all test cases are listed in Table 1. Loudness,
mel-frequency cepstral coecients (MFCC) and line spectral pair
(LSP) give largest RMSE in TAR#1(15 cm) and lowest RMSE in
TAR#3(25 cm). Loudness is extremely sensitive to recording dis-
tance, where the maximum and minimum RMSE values are 0.863
and 0.352, respectively. RMSE in most LLDs initially decreases with
recording distance, but then start to increase when the recording
distance exceed 25 cm. Meanwhile, we observed that MFCC and
LSP presents relatively small variations with recording distance in
cosine distance. The deviations in LLDs comply with our sanity
check results, where TAR#3(25 cm) gains minimum degradation
both in waveform signals and LLDs.
We plot the histograms ofRMSE for
F0
, Loudness and histograms
of cosine distance for MFCC in Fig 7. We perform the Kruskal-Wallis
test [
12
] to investigate the consistence of RMSE among four record-
ing distances. We nd that the dierence is signicant in loudness
(
p=
0
.
0018), but it is fairly consistent in
F0
(
p=
0
.
3607) and
F0
Env (
p=
0
.
2598). MFCC exhibits most consistent measurement
in both RMSE (
p=
0
.
8550) and cosine distance (
p=
0
.
6005). The
results conrms that energy-related LLDs are sensitive to recording
distance, but pitch and spectral LLDs yield consistent patterns with
dierent recording distances. We also observe that majority errors
of
F0
resides under 10 Hz, which conrms the robustness of the
feature recorded by TAR.
(a) F0-RMSE histogram
(b) Loudness-RMSE histogram
(c) MFCC-Cosine Distance histogram
Figure 7: (a), (b) presents the F0-RMSE distribution and the
loudness-RMSE distribution, (c) displays MFFC-cosine dis-
tance distribution
5 USER EXPERIENCE
Fig.8 exhibits the GUI of TAR application. We programmed the
GUI with two buttons for usage simplicity. The function of the
top button is to allow users to quickly disable the audio recording
service for privacy considerations. The other button presented in
the GUI activates the uploading of collected audio features to the
back-end server.
We tested TAR on 20 participants to assess user interface sat-
isfaction. The participants were asked to wear and use the TAR
for a minimum two hours. The Questionnaire for User Interface
Satisfaction (QUIS), an instrument based on [
6
] was used to rate
the TAR interface in terms of overall satisfaction to the software,
screens, ease of learning to use the interface and system capabili-
ties. TAR performs satisfactorily in all these aspects. Regarding the
overall reaction to the software, an average rating of 7.8, 8.6 and
7.9 was reported on the 10 point Likert scale of Terrible/Wonderful,
Dicult/Easy and Frustrating/Satisfying respectively. Ease of learn-
ing to use the system was reported at 9 and straightforwardness of
performing tasks at 9. These numbers show that the intended test
cases felt the UI was easy, simple and straightforward to use.
6 CONCLUSIONS
In this paper, we present TAR, an unobtrusive wearable solution
to track audio activity in the wild. We described hardware and
software comprising the TAR system. We then explained the VAD-
trigger recording scheme in TAR Android App. We designed two
experiments in examining the accuracy of proposed VAD scheme
as well as the quality of the extracted audio features. The results
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
(a) (b)
Figure 8: TAR Android App GUI. (a) UI display when TAR is
idle and not recording. (b) UI display when TAR recording
is turned on.
conrm the reliability of the VAD scheme. We observe minimal
distortion of the recorded features except energy LLDs as measured
by root-mean-square error and cosine distance. We also provided
TAR to 20 participants, and we received high user satisfaction
with the simple user interface. We believe the unobtrusiveness and
exibility of TAR, coupled with privacy-focused data acquisition,
makes it suitable for a wide range of people-centric audio sensing
applications.
Specically, within next four months, we plan to deploy current
TAR system over 150 volunteers who work at the USC Hospital.
We intend to collect only the audio features from individual vol-
unteer during working shift. We plan to use the collected audio
features to investigate the intercommunication behavior, individual
performance, and stress level of participants.
In future works, we plan to add the functions to automatically
learn recording parameter instead of hand-tuning. We also aim to
bring online analysis of the collected feature to classify contextual
information or wearer’s emotion variation in the future updates. At
last, we plan to put the TAR Audio App on the Google Play store
which is publicly accessible.
ACKNOWLEDGMENTS
The research is based upon work supported by the Oce of the
Director of National Intelligence (ODNI), Intelligence Advanced
Research Projects Activity (IARPA), via IARPA Contract No 2017-
17042800005. The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily rep-
resenting the ocial policies or endorsements, either expressed
or implied, of the ODNI, IARPA, or the U.S. Government. The U.S.
Government is authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright annotation
thereon.
REFERENCES
[1]
[n. d.]. emobase. https://github.com/naxingyu/opensmile/blob/master/cong/
emobase.conf
[2] [n. d.]. JELLY. www.unihertz.com/jelly.html
[3]
[n. d.]. VOICEBOX: Speech Processing Toolbox for MATLAB. http://www.ee.ic.
ac.uk/hp/sta/dmb/voicebox/voicebox.html
[4]
Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, and Alex Pentland. 2011. Social
fMRI: Investigating and shaping social mechanisms in the real world. Per vasive
and Mobile Computing 7 (2011), 643–659.
[5]
Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee,
Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan.
2004. Analysis of Emotion Recognition Using Facial Expressions, Speech and
Multimodal Information. In Proceedings of the 6th International Conference on
Multimodal Interfaces (ICMI ’04). ACM, New York, NY, USA, 205–211.
[6]
John P Chin, Virginia A Diehl, and Kent L Norman. 1988. Development of an
instrument measuring user satisfaction of the human-computer interface. In
Proceedings of the SIGCHI conference on Human factors in computing systems.
ACM, 213–218.
[7]
Tanzeem Choudhury and Alex Pentland. 2002. The Sociometer: A Wearable
Device for Understanding Human Networks (CSCW 02 Workshop).
[8]
Florian Eyben, Martin Wöllmer, and Björn Schuller.2010. Opensmile: The Munich
Versatile and Fast Open-source Audio Feature Extractor. In Proceedings of 18th
ACM International Conference on Multimedia. ACM, New York, NY, USA, 1459–
1462.
[9]
Hillary Ganek and Alice Eriks-Brophy. 2018. Language ENvironment analysis
(LENA) system investigation of day long recordings in children: A literature
review. Journal of Communication Disorders 72 (2018), 77 85.
[10]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. 1993. DARPA
TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc
1-1.1. NASA STI/Recon Technical Report N 93 (Feb. 1993).
[11]
T. Klingeberg and M. Schilling. 2012. Mobile wearable device for long term
monitoring of vital signs. Computer Methods and Programs in Biomedicine 106, 2
(2012), 89 96.
[12]
William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion
Variance Analysis. J. Amer. Statist. Assoc. 47, 260 (1952), 583–621.
[13]
Yuhao Liu, James J. S. Norton, Raza Qazi, Zhanan Zou, Kaitlyn R. Ammann, Hank
Liu, Lingqing Yan, Phat L. Tran, Kyung-In Jang, Jung Woo Lee, Douglas Zhang,
Kristopher A. Kilian, Sung Hee Jung, Timothy Bretl, Jianliang Xiao, Marvin J.
Slepian, Yonggang Huang, Jae-Woong Jeong, and John A. Rogers. 2016. Epidermal
mechano-acoustic sensing electronics for cardiovascular diagnostics and human-
machine interfaces. Science Advances 2, 11 (2016).
[14]
T. Martin, E. Jovanov, and D. Raskovic. 2000. Issues in wearable computing for
medical monitoring applications: a case study of a wearable ECG monitoring
device. In Digest of Papers. Fourth International Symposium on WearableComputers.
43–49.
[15]
Matthias R. Mehl, James W. Pennebaker, D. Michael Crow, James Dabbs, and
John H. Price. 2001. The Electronically Activated Recorder (EAR): A device
for sampling naturalistic daily activities and conversations. Behavior Research
Methods, Instruments, Computers 33, 4 (01 Nov 2001), 517–523.
[16]
Shyamal Patel, Hyung Park, Paolo Bonato, Leighton Chan, and Mary Rodgers.
2012. A review of wearable sensors and systems with application in rehabilitation.
Journal of NeuroEngineering and Rehabilitation 9, 1 (20 Apr 2012), 21.
[17]
Joren Six, Olmo Cornelis, and Marc Leman. 2014. TarsosDSP, a Real-Time Audio
Processing Framework in Java. In Proceedings of the 53rd AES Conference (AES
53rd).
[18]
Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. The diverse
environments multi-channel acoustic noise database: A database of multichannel
environmental noise recordings. Journal of the Acoustical Society of America 133,
5 (2013), 3591–3591.
[19]
Mark VanDam. 2014. Acoustic characteristics of the clothes used for a wearable
recording device. The Journal of the Acoustical Society of America 136, 4 (2014),
EL263–EL267.
... We then used the MCAS dataset to simulate multi-speaker conversations between the wearer (SELF), a conversation partner (OTHER), and an interferer. The MCAS dataset contains information on microphone geometry, real ATFs from Aria glasses, and 10k multichannel room impulse responses (RIRs) from rooms ranging in size from [5,5,2] to [10,10,6] meters. To simulate conversations, we positioned SELF, OTHER, and the interferer in space, with conversation overlap between SELF and OTHER, and crosstalk from bystanders. ...
... We then used the MCAS dataset to simulate multi-speaker conversations between the wearer (SELF), a conversation partner (OTHER), and an interferer. The MCAS dataset contains information on microphone geometry, real ATFs from Aria glasses, and 10k multichannel room impulse responses (RIRs) from rooms ranging in size from [5,5,2] to [10,10,6] meters. To simulate conversations, we positioned SELF, OTHER, and the interferer in space, with conversation overlap between SELF and OTHER, and crosstalk from bystanders. ...
... Recent technological advances in wearable sensors have enabled the emergence of applications for capturing social communications from an egocentric perspective in naturalistic settings [20]- [22]. For example, by using the miniature mics on these devices, a digital recorder [21], [23]- [25] can overcome some of the drawbacks of the traditional systems, capturing day-long audio data centered around a person. In this work, we prototype the experiments following a recently proposed wearable audio-sensing solution called TILES Audio Recorder (TAR) [21]. ...
... For example, by using the miniature mics on these devices, a digital recorder [21], [23]- [25] can overcome some of the drawbacks of the traditional systems, capturing day-long audio data centered around a person. In this work, we prototype the experiments following a recently proposed wearable audio-sensing solution called TILES Audio Recorder (TAR) [21]. We would highlight that this wearable solution applies to both controlled and naturalistic settings. ...
Preprint
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children's behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children's behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.
... In order to quantitatively study speech activity among nursing professionals, the present study uses real-life audio collected from a longitudinal and large-scale study undertaken in the hospital setting. The speech activity and participant proximity to different locations are captured from a novel wearable audio sensor [23], alongside other multimodal measurement of physiology and activity [24]. In summary, we aim to quantitatively answer the questions below: ...
... Participants were asked to wear a wearable audio sensor called TAR (TILES Audio Recorder) [23] during each work shift throughout the 10-week period. TAR runs on a small, lightweight, budget-friendly portable Android platform called the Jelly Pro [26]. ...
Preprint
Full-text available
Interpersonal spoken communication is central to human interaction and the exchange of information. Such interactive processes involve not only speech and spoken language but also non-verbal cues such as hand gestures, facial expressions, and nonverbal vocalization, that are used to express feelings and provide feedback. These multimodal communication signals carry a variety of information about the people: traits like gender and age as well as about physical and psychological states and behavior. This work uses wearable multimodal sensors to investigate interpersonal communication behaviors focusing on speaking patterns among healthcare providers with a focus on nurses. We analyze longitudinal data collected from 99 nurses in a large hospital setting over ten weeks. The results indicate that speaking pattern differences across shift schedules and working units. Moreover, results show that speaking patterns combined with physiological measures can be used to predict affect measures and life satisfaction scores. The implementation of this work can be accessed at https://github.com/usc-sail/tiles-audio-arousal.
... • Root Mean Square Energy (RMSE) 51 : ...
Article
Full-text available
In response to the pressing requirement for precise and easily accessible COVID-19 detection methods, we present the Cough2COVID-19 framework, which is cost-effective, non-intrusive, and widely accessible. The conventional diagnostic methods, notably the PCR test, are encumbered by limitations such as cost and invasiveness. Consequently, the exploration of alternative solutions has gained momentum. Our innovative approach employs a multi-layer ensemble deep learning (MLEDL) framework that capitalizes on cough audio signals to achieve heightened efficiency in COVID-19 detection. This study introduces the Cough2COVID-19 framework, effectively addressing these challenges through AI-driven analysis. Additionally, this study proposed the CoughFeatureRanker algorithm, which delves into the robustness of pivotal features embedded within cough audios. The CoughFeatureRanker algorithm selects the most prominent features based on their optimal discriminatory performance from 15 features to detect COVID-19. The effectiveness of the CoughFeatureRanker algorithm within the ensemble framework is scrutinized, confirming its favorable influence on the accuracy of COVID-19 detection. The Cough2COVID-19 (MLEDL) framework achieves remarkable outcomes in COVID-19 detection through cough audio signals, boasting a specificity of 98%, sensitivity of 97%, accuracy of 98%, and an AUC score of 0.981. Our framework asserts its supremacy in precise non-invasive screening through an exhaustive comparison with cutting-edge methodologies. This groundbreaking innovation holds the potential to enhance urban resilience by transforming disease diagnosis, offering a significant approach to curtailing transmission risks and facilitating timely interventions in the ongoing battle against the pandemic.
... Wang et al. [15] utilized mobile sensing indicators to identify anxiety-relevant social contexts, including temporal phases and group sizes. The Electronically Activated Recorder (EAR) [11] has provided a means to capture extended audio data from conversations, while systems developed by researchers like Feng et al. [19] optimize the detection of audio activity by calculating features during specific acoustic events. Acoustic sensing has emerged as a promising tool for understanding social interactions. ...
Preprint
Full-text available
During social interactions, understanding the intricacies of the context can be vital, particularly for socially anxious individuals. While previous research has found that the presence of a social interaction can be detected from ambient audio, the nuances within social contexts, which influence how anxiety provoking interactions are, remain largely unexplored. As an alternative to traditional, burdensome methods like self-report, this study presents a novel approach that harnesses ambient audio segments to detect social threat contexts. We focus on two key dimensions: number of interaction partners (dyadic vs. group) and degree of evaluative threat (explicitly evaluative vs. not explicitly evaluative). Building on data from a Zoom-based social interaction study (N=52 college students, of whom the majority N=45 are socially anxious), we employ deep learning methods to achieve strong detection performance. Under sample-wide 5-fold Cross Validation (CV), our model distinguished dyadic from group interactions with 90\% accuracy and detected evaluative threat at 83\%. Using a leave-one-group-out CV, accuracies were 82\% and 77\%, respectively. While our data are based on virtual interactions due to pandemic constraints, our method has the potential to extend to diverse real-world settings. This research underscores the potential of passive sensing and AI to differentiate intricate social contexts, and may ultimately advance the ability of context-aware digital interventions to offer personalized mental health support.
... There are no search results that directly answer the question of how to deploy cleaning crews during periods of minimal track usage. The search results are about various topics such as hole cleaning during drilling operations (Ashok et al., 2021), social density monitoring for selective cleaning (Vu Le et al., 2021), audio recording (Feng et al., 2018) and noise removal (Goyal et al., 2021), deepwater drilling operations (Johnson et al., 2021), and robotic SPECT imaging. It is possible that the question is too specific or there is not enough information to provide a relevant answer. ...
Conference Paper
Full-text available
This article explores the application of artificial intelligence (AI) techniques in the cleaning of railway tracks and its impact on operational efficiency and safety. By leveraging AI-enabled robots, predictive analytics, real-time monitoring, and automated track inspection systems, railways can achieve enhanced cleanliness, optimized cleaning schedules, and proactive maintenance measures. AI-powered robots equipped with advanced sensors and machine learning algorithms autonomously identify and remove debris, minimizing the risk of accidents and ensuring uninterrupted train operations. Predictive analytics models analyze historical data and passenger flow information to optimize cleaning schedules, reducing disruptions and improving efficiency. Real-time monitoring systems detect potential maintenance issues in advance, allowing for timely preventive measures. Automated track inspection systems powered by AI proactively detect anomalies, ensuring a higher level of quality assurance and facilitating prompt repairs. As AI technology advances, the railway industry can anticipate further innovations, revolutionizing track maintenance and contributing to a more reliable and safe transportation system.
... In this study, researchers instructed participants to wear a Fitbit Charge 2 [25], an OMsignal garment-based sensor [33], and a customized audio badge [18] which collectively tracks heart rate, physical activity, speech characteristics, and many other human-centric signals. Participants were asked to wear the OMsignal garment and audio badge only during their work shifts due to the battery limitation of these devices. ...
Preprint
Full-text available
Continuously-worn wearable sensors enable researchers to collect copious amounts of rich bio-behavioral time series recordings of real-life activities of daily living, offering unprecedented opportunities to infer novel human behavior patterns during daily routines. Existing approaches to routine discovery through bio-behavioral data rely either on pre-defined notions of activities or use additional non-behavioral measurements as contexts, such as GPS location or localization within the home, presenting risks to user privacy. In this work, we propose a novel wearable time-series mining framework, Hawkes point process On Time series clusters for ROutine Discovery (HOT-ROD), for uncovering behavioral routines from completely unlabeled wearable recordings. We utilize a covariance-based method to generate time-series clusters and discover routines via the Hawkes point process learning algorithm. We empirically validate our approach for extracting routine behaviors using a completely unlabeled time-series collected continuously from over 100 individuals both in and outside of the workplace during a period of ten weeks. Furthermore, we demonstrate this approach intuitively captures daily transitional relationships between physical activity states without using prior knowledge. We also show that the learned behavioral patterns can assist in illuminating an individual's personality and affect.
... Participants engaged in their typical daily activities while being equipped with ambulatory wearable devices and sensors to collect vocal acoustic and physiological signals throughout the period of data collection. A Fitbit Charge 2 was used to measure sleep activity and exercise, an OMsignal garment collected heart rate and breathing rate, and the Unihertz Jelly Pro smartphone, a small and lightweight phone worn on the lapel, was programmed to obtain vocal acoustic features from statistically sampled egocentric audio recordings (91). ...
Article
Full-text available
Introduction: Intelligent ambulatory tracking can assist in the automatic detection of psychological and emotional states relevant to the mental health changes of professionals with high-stakes job responsibilities, such as healthcare workers. However, well-known differences in the variability of ambulatory data across individuals challenge many existing automated approaches seeking to learn a generalizable means of well-being estimation. This paper proposes a novel metric learning technique that improves the accuracy and generalizability of automated well-being estimation by reducing inter-individual variability while preserving the variability pertaining to the behavioral construct. Methods: The metric learning technique implemented in this paper entails learning a transformed multimodal feature space from pairwise similarity information between (dis)similar samples per participant via a Siamese neural network. Improved accuracy via personalization is further achieved by considering the trait characteristics of each individual as additional input to the metric learning models, as well as individual trait base cluster criteria to group participants followed by training a metric learning model for each group. Results: The outcomes of the proposed models demonstrate significant improvement over the other inter-individual variability reduction and deep neural baseline methods for stress, anxiety, positive affect, and negative affect. Discussion: This study lays the foundation for accurate estimation of psychological and emotional states in realistic and ambulatory environments leading to early diagnosis of mental health changes and enabling just-in-time adaptive interventions.
Article
Understanding social interactions is relevant across many domains and applications, including psychology, behavioral sciences, human computer interaction, and healthcare. In this paper, we present a practical approach for automatically detecting face-to-face conversations by leveraging the acoustic sensing capabilities of an off-the-shelf, unmodified smartwatch. Our proposed framework incorporates feature representations extracted from different neural network setups and shows the benefits of feature fusion. The framework does not require an acoustic model specifically trained to the speech of the individual wearing the watch or of those nearby. We evaluate our framework with 39 participants in 18 homes in a semi-naturalistic study and with four participants in free living, obtaining an F1 score of 83.2% and 83.3% respectively for detecting user's conversations with the watch. Additionally, we study the real-time capability of our framework by deploying a system on an actual smartwatch and discuss several strategies to improve its practicality in real life. To support further work in this area by the research community, we also release our annotated dataset of conversations.
Article
Engagement is critical to satisfaction and performance in a number of domains but is challenging to measure and sustain. Thus, there is considerable interest in developing affective computing technologies to automatically measure and enhance engagement, especially in the wild and at scale. This article provides an accessible introduction to affective computing research on engagement detection and enhancement using educational applications as an application domain. We begin with defining engagement as a multicomponential construct (i.e., a conceptual entity) situated within a context and bounded by time and review how the past six years of research has conceptualized it. Next, we examine traditional and affective computing methods for measuring engagement and discuss their relative strengths and limitations. Then, we move to a review of proactive and reactive approaches to enhancing engagement toward improving the learning experience and outcomes. We underscore key concerns in engagement measurement and enhancement, especially in digitally enhanced learning contexts, and conclude with several open questions and promising opportunities for future work.
Article
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach’s alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Article
Full-text available
Physiological mechano-acoustic signals, often with frequencies and intensities that are beyond those associated with the audible range, provide information of great clinical utility. Stethoscopes and digital accelerometers in conventional packages can capture some relevant data, but neither is suitable for use in a continuous, wearable mode, and both have shortcomings associated with mechanical transduction of signals through the skin. We report a soft, conformal class of device configured specifically for mechano-acoustic recording from the skin, capable of being used on nearly any part of the body, in forms that maximize detectable signals and allow for multimodal operation, such as electrophysiological recording. Experimental and computational studies highlight the key roles of low effective modulus and low areal mass density for effective operation in this type of measurement mode on the skin. Demonstrations involving seismocardiography and heart murmur detection in a series of cardiac patients illustrate utility in advanced clinical diagnostics. Monitoring of pump thrombosis in ventricular assist devices provides an example in characterization of mechanical implants. Speech recognition and human-machine interfaces represent additional demonstrated applications. These and other possibilities suggest broad-ranging uses for soft, skin-integrated digital technologies that can capture human body acoustics.
Article
Full-text available
There has been increasing attention in the literature to wearable acoustic recording devices, particularly to examine naturalistic speech in disordered and child populations. Recordings are typically analyzed using automatic procedures that critically depend on the reliability of the collected signal. This work describes the acoustic amplitude response characteristics and the possibility of acoustic transmission loss using several shirts designed for wearable recorders. No difference was observed between the response characteristics of different shirt types or between shirts and the bare-microphone condition. Results are relevant for research, clinical, educational, and home applications in both practical and theoretical terms.
Article
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach's alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Article
Full-text available
Multi-microphone arrays allow for the use of spatial filtering techniques that can greatly improve noise reduction and source separation. However, for speech and audio data, work on noise reduction or separation has focused primarily on one- or two-channel systems. Because of this, databases of multichannel environmental noise are not widely available. DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) addresses this problem by providing a set of 16-channel noise files recorded in a variety of indoor and outdoor settings. The data was recorded using a planar microphone array consisting of four staggered rows, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. DEMAND is freely available under a Creative Commons license to encourage research into algorithms beyond the stereo setup.
Article
Full-text available
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Article
Full-text available
A recording device called the Electronically Activated Recorder (EAR) is described. The EAR tape-records for 30 sec once every 12 min for 2–4 days. It is lightweight and portable, and it can be worn comfortably by participants in their natural environment. The acoustic data samples provide a nonobtrusive record of the language used and settings entered by the participant. Preliminary psychometric findings suggest that the EAR data accurately reflect individuals’ natural social, linguistic, and psychological lives. The data presented in this article were collected with a first-generation EAR system based on analog tape recording technology, but a second generation digital EAR is now available.
Article
The Language ENvironment Analysis (LENA) System is a relatively new recording technology that can be used to investigate typical child language acquisition and populations with language disorders. The purpose of this paper is to familiarize language acquisition researchers and speech-language pathologists with how the LENA System is currently being used in research. The authors outline issues in peer-reviewed research based on the device. Considerations when using the LENA System are discussed.
Article
This paper presents TarsosDSP, a framework for real-time audio analysis and processing. Most libraries and frameworks offer either audio analysis and feature extraction or audio synthesis and processing. TarsosDSP is one of a only a few frameworks that offers both analysis, processing and feature extraction in real-time, a unique feature in the Java ecosystem. The framework contains practical audio processing algorithms, it can be extended easily, and has no external dependencies. Each algorithm is implemented as simple as possible thanks to a straightforward processing pipeline. TarsosDSP's features include a resampling algorithm, onset detectors, a number of pitch estimation algorithms, a time stretch algorithm, a pitch shifting algorithm, and an algorithm to calculate the Constant-Q. The framework also allows simple audio synthesis, some audio effects, and several filters. The Open Source framework is a valuable contribution to the MIR-Community and ideal fit for interactive MIR-applications on Android.
Article
Given C samples, with ni observations in the ith sample, a test of the hypothesis that the samples are from the same population may be made by ranking the observations from from 1 to Σni (giving each observation in a group of ties the mean of the ranks tied for), finding the C sums of ranks, and computing a statistic H. Under the stated hypothesis, H is distributed approximately as χ(C – 1), unless the samples are too small, in which case special approximations or exact tables are provided. One of the most important applications of the test is in detecting differences among the population means.** Based in part on research supported by the Office of Naval Research at the Statistical Research Center, University of Chicago.