Conference PaperPDF Available

TILES audio recorder: an unobtrusive wearable solution to track audio activity


Abstract and Figures

Most existing speech activity trackers used in human subject studies are bulky, record raw audio content which invades participant privacy, have complicated hardware and non-customizable software, and are too expensive for large-scale deployment. The present effort seeks to overcome these challenges by proposing the TILES Audio Recorder (TAR) - an unobtrusive and scalable solution to track audio activity using an affordable miniature mobile device with an open-source app. For this recorder, we make use of Jelly Pro Mobile, a pocket-sized Android smartphone, and employ two open-source toolkits: openSMILE and Tarsos-DSP. Tarsos-DSP provides a Voice Activity Detection capability that triggers openSMILE to extract and save audio features only when the subject is speaking. Experiments show that performing feature extraction only during speech segments greatly increases battery life, enabling the subject to wear the recorder up to 10 hours at time. Furthermore, recording experiments with ground-truth clean speech show minimal distortion of the recorded features, as measured by root mean-square error and cosine distance. The TAR app further provides subjects with a simple user interface that allows them to both pause feature extraction at any time and also easily upload data to a remote server.
Content may be subject to copyright.
TILES Audio Recorder: An unobtrusive wearable solution to
track audio activity
Tiantian Feng
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
Amrutha Nadarajan
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
Colin Vaz
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
Brandon Booth
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
Shrikanth Narayanan
Signal Analysis and Interpretation
Lab, University of Southern California
Los Angeles, California
Most existing speech activity trackers used in human subject studies
are bulky, record raw audio content which invades participant pri-
vacy, have complicated hardware and non-customizable software,
and are too expensive for large-scale deployment. The present ef-
fort seeks to overcome these challenges by proposing the TILES
Audio Recorder (TAR) - an unobtrusive and scalable solution to
track audio activity using an aordable miniature mobile device
with an open-source app. For this recorder, we make use of Jelly
Pro Mobile, a pocket-sized Android smartphone, and employ two
open-source toolkits: openSMILE and Tarsos-DSP. Tarsos-DSP pro-
vides a Voice Activity Detection capability that triggers openSMILE
to extract and save audio features only when the subject is speaking.
Experiments show that performing feature extraction only during
speech segments greatly increases battery life, enabling the subject
to wear the recorder up to 10 hours at time. Furthermore, recording
experiments with ground-truth clean speech show minimal distor-
tion of the recorded features, as measured by root mean-square
error and cosine distance. The TAR app further provides subjects
with a simple user interface that allows them to both pause fea-
ture extraction at any time and also easily upload data to a remote
Human-centered computing
Ubiquitous and mobile com-
puting design and evaluation methods;
Audio, wearable sensing, audio processing, openSMILE, privacy,
human subjects study, audio feature recorder
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
WearSys ’18, June 10, 2018, Munich, Germany
©2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5842-2/18/06. . . $15.00
ACM Reference Format:
Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon Booth, and Shrikanth
Narayanan. 2018. TILES Audio Recorder: An unobtrusive wearable solution
to track audio activity. In WearSys ’18: 4th ACM Workshop on Wearable
Systems and Applications, June 10, 2018, Munich, Germany. ACM, New York,
NY, USA, Article 4, 6 pages.
Rapid advances in signal acquisition and sensing technology have
produced miniaturized and low-power sensors capable of moni-
toring people’s activities, physiological functions, and their envi-
ronment [
]. When coupled with recent large-scale data analysis
techniques, they show great promise for many applications, such
as health monitoring, rehabilitation, tness, and well being [
These advances have been primarily enabled by the ubiquitous
mobile phone, which facilitates sensor setup, data transfer, and user
feedback [
]. These non-invasive devices have made the design
and implementation of medical, psychological, and behavioral re-
search studies easier by reducing participant burden. We focus on
applications of these technologies to audio recording of human
speech in the wild.
Audio recording devices have come a long way, from bulky
phonographs to hand-held microphones and lapel mics. The legacy
systems are cumbersome, obtrusive, expensive, and not scalable.
They can also potentially invade privacy by recording and storing
raw audio. Mobile phones, laptops, and wearables are becoming
a part of our everyday life. By using the miniature mics on these
devices, a digital recorder [
] can overcome some of the
drawbacks of the legacy systems, like obtrusiveness, and poor scala-
bility. These modern devices are also capable of capturing day-long
audio samples in naturalistic environments. Most digital recording
solutions, like the EAR [
], aord privacy by allowing the user to
delete recordings retrospectively. This process demands additional
eort from the user to ensure their privacy is not violated and
potentially jeopardizes the privacy of the people the user interacts
A recently-developed audio recording solution for human sub-
jects studies is EAR [
]. EAR is an app that runs on a personal
smart phone and records small fractions of audio through the day
at regular intervals. The recording scheme senses audio only 5%
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
of the day and does not have an intelligent recording trigger to
initiate the recording. This could potentially lead to recordings
containing no audio activity or failing to record salient audio ac-
tivity. Other notable eorts in the development of wearable audio
recorders include devices like the Sociometer [
]. This device con-
tains a microphone, motion sensor, and proximity sensor embedded
in a shoulder pad that can be strapped on. However, this device is
not commercially available which makes large-scale deployment
prohibitive. Soundsense[
] is another audio wearable device that
classies dierent auditory scenes (music, speech, ambient sound)
and events (gender, music, speech analysis) surrounding the user. It
runs on mobile devices so it could scale to large study populations,
but the platform is not publicly available. Unfortunately, recording
solutions using phones usually suer from poor recording qual-
ity because people commonly carry phones in their pockets or
handbags. A study by VanDam [
] has shown that denser mate-
rial composition and indirect microphone placement introduces
amplitude loss at frequencies higher than 1 kHz.
In this paper, we introduce a new digital recording system called
Audio Recorder (TAR), which enables passive sensing of
audio activity in sensitive environments where recording raw audio
is not a viable option. The recorder comes with an audio activity
detector and a customizable recording scheme to initiate audio
feature extraction as required by the demands of the investigator.
The recorded features are not audible, but a large number of studies
have successfully predict the emotions or stress from the speech
features [5].
In the following sections, we give an overview of the TAR system
and provide experimental results about its expected battery life and
robustness of audio activity detection, and we also analyze the audio
feature degradation at various distances from an audio source. Our
results show that the system works as intended and introduces
negligible feature degradation.
In this section, we provide a detailed description of the hardware
and software comprising the TAR system. TAR is a combination of
a light-weight Android mobile device and an Android app to record
audio features.
2.1 Motivation
As we described in Section 1, user privacy and unobtrusiveness
are critical factors in designing a wearable audio recorder. In this
study, we propose TAR, which is especially useful for deploying
in environments like a hospital where privacy protections restrict
raw audio recording. Moreover, TAR is designed to be unobtrusive
and comfortable when worn. In addition, investigators can easily
reproduce the setup given that TAR runs on consumer hardware
and the software is freely available on Github. TAR introduces
various features to address many study design challenges when
using wearable audio recorders: 2
aordable cost allows scaling to a large number of partici-
1TILES: Tracking IndividuaL pErformance using Sensors
2TAR github:
Figure 1: A recommended setup for TAR
ease of placing the sensor close to the wearer’s mouth pro-
vides consistently high recording quality
extracting and storing audio features and immediately dis-
carding raw audio protects user privacy
an intuitive user interface enables wireless data transfer and
data capture to be paused and resumed at any time
2.2 Hardware
TAR is designed to run on Android platforms because it is open
source and runs on a variety of devices including small, lightweight,
budget-friendly portable smartphones like the Jelly Pro [
]. The Jelly
Pro smartphone, used in this work, has the following specications:
Dimensions and weight: 3.07x4.92x1.97 in, 60.4 g
Processor: Quad-core processor, 1.1 GHz
Battery capacity: 950 mAh
Memory: 2 GB RAM / 16 GB ROM
Sensors: microphone, camera, compass, gyroscope
A complete hardware setup consists of a Jelly Pro phone, a phone
case, and a metal badge clip with PVC straps. The phone case is
commercially available, and Jelly Pro is packed into the case. A
metal badge clip, which is fastened to the case, is clipped to a
user’s clothes near the collar. With this setup, the distance from
the wearer’s mouth to the phone’s integrated microphone typically
ranges from 15 cm to 30 cm. A recommended setup is shown in
Fig. 1.
2.3 Software
In this section, we describe the TAR Android application and the var-
ious service components and open source tools that enable record-
ing of audio information.
App services
The TAR Android application starts when the
mobile phone switches on. The TAR app includes the following
services running in the background:
a Voice Activity Detection (VAD) service that runs periodi-
cally to listen for speech and if detected it triggers the audio
feature extraction service
an audio feature extraction service to extract audio features
from the raw audio stream. The raw audio is immediately
deleted after the features are extracted and only the features
are saved.
a battery service to record the battery percentage informa-
tion every 5 minutes. The function of the service is to opti-
mize the battery performance.
TILES Audio Recorder: An unobtrusive wearable solution to track audio activity WearSys ’18, June 10, 2018, Munich, Germany
Figure 2: Recording control logic output with a simulated data stream
a data upload service to securely transfer data to a remote
server over WiFi.
Open-source libraries
Most current mobile phone speech pro-
cessing libraries rely on a frontend-backend architecture. The fron-
tend running on the phone typically record and transmit audio
snippets to the backend over an Internet connection. The back-
end, which commonly runs on a remote server, retrieves the audio
recordings, and performs feature extraction. Such a solution reduces
computational burden on the phone and decouples data acquisition
and processing. However, in our experience, fewer participants are
willing to wear devices that attempt to transmit the raw audio. To
allow TAR to be used where security and privacy are a concern, the
TAR app runs real-time speech processing engines on the phone to
avoid transmitting raw audio.
In our TAR application, we incorporate two real-time speech
processing libraries: openSMILE [
] and TarsosDSP [
]. TarsosDSP
is a Java library for audio processing, which outputs real-time audio
energy and pitch. It serves as a straight-forward implementation
of a VAD in TAR. OpenSMILE is a tool for extracting a wide range
of features from audio signals. The extracted audio features from
openSMILE have been used for classifying emotional states and
speaker properties [
]. In our application, the feature extraction
runs entirely on the mobile device and an Internet connection is
not required. One other benet TAR gains as a result of using
openSMILE is that it is highly congurable and allows investigators
to extract various combinations of audio features to suit dierent
purposes. Besides, we have also tested other potential DSP libraries
including Funf [
]. However, Funf only extracts FFT and MFCCs
from the audio stream, thus researchers are constrained in acquiring
other valuable features from the audio.
In this section, we describe the VAD-triggered recording scheme
used by TAR. We propose a recording scheme which only attempts
to record audio features when audio activity is detected in order
to reduce storage requirements and increase battery life. We use
Tarsos-DSP [
] to implement the VAD. The TarsosDSP outputs
the audio energy data every 10 ms, which is then used to determine
the presence of audio activity by TAR.
Figure 3: TAR Battery life vs To
3.1 Recording Parameters
The TAR initiates feature recording when the VAD service is active
and the recording trigger is on.
species the duration that the
VAD service is run, and
controls the length of the idle state
between two consecutive VAD services. A larger
means a longer
idle state where no VAD service take place. In addition, we monitor
the audio energy output from Tarsos-DSP to ensure that the feature
extraction happens only when there is some audio activity. Two
additional parameters control when the recording trigger is set
to on: hand
.his set as a threshold for speech energy and
sets the minimum time that speech energy needs to stay above h
to trigger feature extraction. Once triggered, TAR runs the audio
feature extraction service for lseconds. In TAR, we also add the
exibility to enable audio feature recordings periodically even when
VAD service failed to detect the speech in order to sample some
environment information.
The recording parameters in TAR are tunable and should be
adjusted to provide the highest resolution of audio samples for the
duration required by the researcher. Figure 3 presents the average
battery operating time of TAR for dierent
(second) values. We
tested and measured the battery operating time for 20 dierent Jelly
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
Figure 4: Schematic of experimental setup for examining
VAD accuracy and feature degradation where d1=
20,d3=25,d4=30 cm
Pro devices for each value of
with h
so the VAD service
triggered audio feature recording every time.
As described in Section 3, TAR uses a VAD to extract audio features
only during regions of audio activity. As the device is intended for
use in the wild, we tested the VAD accuracy in noisy scenarios as a
function of
. We also inspect if the device introduces any quality
degradation in the features extracted. The following subsections
describe the experimental setup and discuss the results of the VAD
accuracy and feature degradation experiments.
4.1 Experimental Setup
To test VAD accuracy and feature distortion, we proposed a record-
ing setup to allow TAR to record speech amplied through a speaker.
We synthetically created two sets of audio, one for testing VAD
accuracy and the other for testing feature degradation. For the VAD
experiment, we randomly chose 250 gender-balanced utterances
from the TIMIT database [
]. We concatenated all utterances and
introduced silence of random duration (between 2to 12 seconds) be-
tween utterances to simulate non-continuous speech. Additionally,
we added noise from the DEMAND database [
] to the concate-
nated utterances at 10 dB, 5dB, and 0dB SNR levels to test the VAD
performance in noisy speech. For the feature degradation experi-
ment, we randomly sampled 1000 gender-balanced utterances from
the TIMIT database and concatenated them into one le.
The audio les were then played through a loud-speaker. Mul-
tiples TARs set at distances 15 cm, 20 cm, 25 cm, and 30 cm from
the speaker recorded the audio playback simultaneously. We chose
these distances to mimic the typical range of distances the TAR
would be from a user’s mouth during use. Figure 4 shows a diagram
of the experimental setup. For the VAD accuracy experiment, we
chose to store outputs from the Tarsos-DSP and later set dier-
ent thresholds for hand
on the stored output. For the feature
quality experiment, we modied the operation of TAR to record
continuously. The recordings from the devices were compared to
the features extracted from the raw audio le.
4.2 VAD Accuracy Experiment
We set TAR at a distance of 20 cm from the speaker in the VAD
accuracy experiment as it is a typical distance from a user’s mouth.
In order to evaluate our Tarsos-DSP VAD scheme, we rst output
a VAD result from the clean audio source le using the
function provided in the VOICEBOX speech processing toolbox
]. The VAD result is a binary decision on a frame-by-frame basis,
Figure 5: VAD prediction accuracy as a function of energy
threshold husing Tarsos-DSP library
Figure 6: Confusion matrix of VAD output (%) for using
Tarsos-DSP library and baseline output, where h=56 dB
where each frame is 10 ms in length. Following this, we manually
correct the
output in the regions that gives false predic-
tions to produce a baseline VAD output. Meanwhile, we use the
Tarsos-DSP output to generate the VAD decisions on the recorded
audio in the clean and noisy conditions. As we stated in Section 2,
VAD output is especially sensitive to
. To choose the best
, we
65 dB to
50 dB by 1dB with
xed at 100 ms.
Finally, we compare the baseline VAD output and Tarsos-DSP VAD
outputs at each h.
Figure 5 presents the VAD prediction accuracy in percentage with
respect to
. We observe that VAD accuracy rates initially increase
and then decrease signicantly. Dierence in the prediction
accuracy among four speech SNR conditions is less signicant when
64 dB or
58 dB. The prediction accuracy reaches the peak
56 dB with clean speech,
56 dB with SNR=10 dB,
55 dB
with SNR=5dB and
54 dB with SNR=0dB. Figure 6 presents the
confusion matrix of VAD outputs between Tarsos-DSP and baseline
56 dB. We observe that in clean speech and high SNR
(10 dB, 5dB) conditions, the negative predictive rate is above 75%,
while in poor SNR (0dB) condition, it drops to 68
98%. In addition,
we note that the precision rates vary slightly (
5%) across SNR
conditions. We advise users to setup similar recording experiments
to calibrate hfor own usage.
4.3 Feature Quality Experiment
In this experiment, we apply the emobase cong setting [
] to ex-
tract the openSMILE low-level descriptors (LLDs) from the ground
TILES Audio Recorder: An unobtrusive wearable solution to track audio activity WearSys ’18, June 10, 2018, Munich, Germany
Table 1: Deviation of LLDs in utterance-region
RMSE (root-mean-squared error)
LLD 15cm 20cm 25cm 30cm
F0(Hz) 66.99 55.86 66.33 65.65
F0Env(Hz) 45.87 41.10 51.46 45.13
Loudness 0.863 0.507 0.352 0.352
MFCC[1-14] 45.52 43.77 41.45 42.81
LSP[1-8] 0.551 0.492 0.444 0.479
Cosine Distance
LLD 15cm 20cm 25cm 30cm
MFCC[1-14] 0.199 0.172 0.175 0.184
LSP[1-8] 0.00195 0.00153 0.00139 0.00150
truth concatenated TIMIT le and recordings. The feature set con-
tains 26 acoustic Low-Level Descriptors(LLDs), such as voicing re-
lated features(Pitch, voicing probability), energies (Intensity, Loud-
ness), zero-crossing rate(ZCR) and cepstral information(MFCCs
1-14). The proposed LLDs are extracted every 10 ms using a 25 ms
Hamming window
We compared the LLDs by measuring the root-mean-squared
error (RMSE) and cosine distance between the features extracted
from the source. Some features, however, are sensitive to energy
levels, and they often show higher degradation in silence regions.
Thus, we decided to compare the feature values only in utterance
regions. Firstly, we perform the sanity check on the raw signals. We
use the
function to remove signals in the silence regions,
and we obtain the RMSE between source audio signal and recorded
signal in utterance regions are 0
1297 (15 cm), 0
0790 (20 cm), 0
(25 cm), and 0.0598 (30 cm).
We then use the
function to remove LLDs in the silence
regions. RMSE and cosine distance between ground truth LLDs and
TAR recorded LLDs in all test cases are listed in Table 1. Loudness,
mel-frequency cepstral coecients (MFCC) and line spectral pair
(LSP) give largest RMSE in TAR#1(15 cm) and lowest RMSE in
TAR#3(25 cm). Loudness is extremely sensitive to recording dis-
tance, where the maximum and minimum RMSE values are 0.863
and 0.352, respectively. RMSE in most LLDs initially decreases with
recording distance, but then start to increase when the recording
distance exceed 25 cm. Meanwhile, we observed that MFCC and
LSP presents relatively small variations with recording distance in
cosine distance. The deviations in LLDs comply with our sanity
check results, where TAR#3(25 cm) gains minimum degradation
both in waveform signals and LLDs.
We plot the histograms ofRMSE for
, Loudness and histograms
of cosine distance for MFCC in Fig 7. We perform the Kruskal-Wallis
test [
] to investigate the consistence of RMSE among four record-
ing distances. We nd that the dierence is signicant in loudness
0018), but it is fairly consistent in
3607) and
Env (
2598). MFCC exhibits most consistent measurement
in both RMSE (
8550) and cosine distance (
6005). The
results conrms that energy-related LLDs are sensitive to recording
distance, but pitch and spectral LLDs yield consistent patterns with
dierent recording distances. We also observe that majority errors
resides under 10 Hz, which conrms the robustness of the
feature recorded by TAR.
(a) F0-RMSE histogram
(b) Loudness-RMSE histogram
(c) MFCC-Cosine Distance histogram
Figure 7: (a), (b) presents the F0-RMSE distribution and the
loudness-RMSE distribution, (c) displays MFFC-cosine dis-
tance distribution
Fig.8 exhibits the GUI of TAR application. We programmed the
GUI with two buttons for usage simplicity. The function of the
top button is to allow users to quickly disable the audio recording
service for privacy considerations. The other button presented in
the GUI activates the uploading of collected audio features to the
back-end server.
We tested TAR on 20 participants to assess user interface sat-
isfaction. The participants were asked to wear and use the TAR
for a minimum two hours. The Questionnaire for User Interface
Satisfaction (QUIS), an instrument based on [
] was used to rate
the TAR interface in terms of overall satisfaction to the software,
screens, ease of learning to use the interface and system capabili-
ties. TAR performs satisfactorily in all these aspects. Regarding the
overall reaction to the software, an average rating of 7.8, 8.6 and
7.9 was reported on the 10 point Likert scale of Terrible/Wonderful,
Dicult/Easy and Frustrating/Satisfying respectively. Ease of learn-
ing to use the system was reported at 9 and straightforwardness of
performing tasks at 9. These numbers show that the intended test
cases felt the UI was easy, simple and straightforward to use.
In this paper, we present TAR, an unobtrusive wearable solution
to track audio activity in the wild. We described hardware and
software comprising the TAR system. We then explained the VAD-
trigger recording scheme in TAR Android App. We designed two
experiments in examining the accuracy of proposed VAD scheme
as well as the quality of the extracted audio features. The results
WearSys ’18, June 10, 2018, Munich, Germany T. Feng et al.
(a) (b)
Figure 8: TAR Android App GUI. (a) UI display when TAR is
idle and not recording. (b) UI display when TAR recording
is turned on.
conrm the reliability of the VAD scheme. We observe minimal
distortion of the recorded features except energy LLDs as measured
by root-mean-square error and cosine distance. We also provided
TAR to 20 participants, and we received high user satisfaction
with the simple user interface. We believe the unobtrusiveness and
exibility of TAR, coupled with privacy-focused data acquisition,
makes it suitable for a wide range of people-centric audio sensing
Specically, within next four months, we plan to deploy current
TAR system over 150 volunteers who work at the USC Hospital.
We intend to collect only the audio features from individual vol-
unteer during working shift. We plan to use the collected audio
features to investigate the intercommunication behavior, individual
performance, and stress level of participants.
In future works, we plan to add the functions to automatically
learn recording parameter instead of hand-tuning. We also aim to
bring online analysis of the collected feature to classify contextual
information or wearer’s emotion variation in the future updates. At
last, we plan to put the TAR Audio App on the Google Play store
which is publicly accessible.
The research is based upon work supported by the Oce of the
Director of National Intelligence (ODNI), Intelligence Advanced
Research Projects Activity (IARPA), via IARPA Contract No 2017-
17042800005. The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily rep-
resenting the ocial policies or endorsements, either expressed
or implied, of the ODNI, IARPA, or the U.S. Government. The U.S.
Government is authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright annotation
[n. d.]. emobase.g/
[2] [n. d.]. JELLY.
[n. d.]. VOICEBOX: Speech Processing Toolbox for MATLAB./dmb/voicebox/voicebox.html
Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, and Alex Pentland. 2011. Social
fMRI: Investigating and shaping social mechanisms in the real world. Per vasive
and Mobile Computing 7 (2011), 643–659.
Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee,
Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan.
2004. Analysis of Emotion Recognition Using Facial Expressions, Speech and
Multimodal Information. In Proceedings of the 6th International Conference on
Multimodal Interfaces (ICMI ’04). ACM, New York, NY, USA, 205–211.
John P Chin, Virginia A Diehl, and Kent L Norman. 1988. Development of an
instrument measuring user satisfaction of the human-computer interface. In
Proceedings of the SIGCHI conference on Human factors in computing systems.
ACM, 213–218.
Tanzeem Choudhury and Alex Pentland. 2002. The Sociometer: A Wearable
Device for Understanding Human Networks (CSCW 02 Workshop).
Florian Eyben, Martin Wöllmer, and Björn Schuller.2010. Opensmile: The Munich
Versatile and Fast Open-source Audio Feature Extractor. In Proceedings of 18th
ACM International Conference on Multimedia. ACM, New York, NY, USA, 1459–
Hillary Ganek and Alice Eriks-Brophy. 2018. Language ENvironment analysis
(LENA) system investigation of day long recordings in children: A literature
review. Journal of Communication Disorders 72 (2018), 77 85.
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. 1993. DARPA
TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc
1-1.1. NASA STI/Recon Technical Report N 93 (Feb. 1993).
T. Klingeberg and M. Schilling. 2012. Mobile wearable device for long term
monitoring of vital signs. Computer Methods and Programs in Biomedicine 106, 2
(2012), 89 96.
William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion
Variance Analysis. J. Amer. Statist. Assoc. 47, 260 (1952), 583–621.
Yuhao Liu, James J. S. Norton, Raza Qazi, Zhanan Zou, Kaitlyn R. Ammann, Hank
Liu, Lingqing Yan, Phat L. Tran, Kyung-In Jang, Jung Woo Lee, Douglas Zhang,
Kristopher A. Kilian, Sung Hee Jung, Timothy Bretl, Jianliang Xiao, Marvin J.
Slepian, Yonggang Huang, Jae-Woong Jeong, and John A. Rogers. 2016. Epidermal
mechano-acoustic sensing electronics for cardiovascular diagnostics and human-
machine interfaces. Science Advances 2, 11 (2016).
T. Martin, E. Jovanov, and D. Raskovic. 2000. Issues in wearable computing for
medical monitoring applications: a case study of a wearable ECG monitoring
device. In Digest of Papers. Fourth International Symposium on WearableComputers.
Matthias R. Mehl, James W. Pennebaker, D. Michael Crow, James Dabbs, and
John H. Price. 2001. The Electronically Activated Recorder (EAR): A device
for sampling naturalistic daily activities and conversations. Behavior Research
Methods, Instruments, Computers 33, 4 (01 Nov 2001), 517–523.
Shyamal Patel, Hyung Park, Paolo Bonato, Leighton Chan, and Mary Rodgers.
2012. A review of wearable sensors and systems with application in rehabilitation.
Journal of NeuroEngineering and Rehabilitation 9, 1 (20 Apr 2012), 21.
Joren Six, Olmo Cornelis, and Marc Leman. 2014. TarsosDSP, a Real-Time Audio
Processing Framework in Java. In Proceedings of the 53rd AES Conference (AES
Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. The diverse
environments multi-channel acoustic noise database: A database of multichannel
environmental noise recordings. Journal of the Acoustical Society of America 133,
5 (2013), 3591–3591.
Mark VanDam. 2014. Acoustic characteristics of the clothes used for a wearable
recording device. The Journal of the Acoustical Society of America 136, 4 (2014),
... The Atom phone ran custom software and served as a social-sensing, environment-sensing badge, and audio collection device using Bluetooth and its built-in microphone. The audio collection capability was handled by the TILES Audio Recorder (TAR), which immediately processed recorded audio samples by storing anonymized features and discarding the recording 35 . Participants charged their phones when they were not at work. ...
... We presented an analysis of the audio recorder in 35 and followed the same procedure as 16 . TAR primarily extracted the audio features using openSMILE 40 . ...
... OpenSMILE is a widely used tool for extracting a wide range of features from audio signals. To test the feature distortion from the recording device, a recording setup was proposed in 35 to allow TAR to record speech amplified through a speaker. In this feature degradation experiment, 1000 gender-balanced utterances from the TIMIT 44 database were randomly sampled and concatenated into one file. ...
Full-text available
Measurement(s) Stress • Burnout • Affect • Depression • Sleep • Physical Activity Measurement • Alcohol Use History • Frequency Any Tobacco Use • Personality • Social Support • Intragroup Conflict • Challenge and Hindrance Stressors • Demographics • Context and Atypical Events • Daily Stressors • Most Stressful Event • Work Context • Job Performance • Job Satisfaction • Stressors at Work • Charting at Home • Coworker Trust • Social Networks at Work • Socialization Outside of Work • Use of Wellness Resources • Heart Rate • Step Count • Acoustic Features • Team Interactions • Proximity to Key Objects • Cell Phone Use • Hospital Contextual Data • Coping with Stress • Productivity at Work • Pride at Work • Teamwork • Support System Technology Type(s) Perceived Stress Scale - 14 Questionnaire • Survey • Patient Health Questionnaire - 9 Item • Pittsburgh Sleep Quality Index • FitBit • International Physical Activity Questionnaire (August 2002) Short Last 7 Days Self-Administered Format • Unihertz Atom Phone • Minew E8- TILES Interaction Sensors • Minew E8- Eddystone Beach • Rescuetime • Evaluations • Patient Census • Interview Sample Characteristic - Organism Homo sapiens Sample Characteristic - Location Los Angeles County and University of Southern California Medical Center
... In order to quantitatively study speech activity among nursing professionals, the present study uses real-life audio collected from a longitudinal and large-scale study undertaken in the hospital setting. The speech activity and participant proximity to different locations are captured from a novel wearable audio sensor [23], alongside other multimodal measurement of physiology and activity [24]. In summary, we aim to quantitatively answer the questions below: ...
... Participants were asked to wear a wearable audio sensor called TAR (TILES Audio Recorder) [23] during each work shift throughout the 10-week period. TAR runs on a small, lightweight, budget-friendly portable Android platform called the Jelly Pro [26]. ...
Full-text available
Interpersonal spoken communication is central to human interaction and the exchange of information. Such interactive processes involve not only speech and spoken language but also non-verbal cues such as hand gestures, facial expressions, and nonverbal vocalization, that are used to express feelings and provide feedback. These multimodal communication signals carry a variety of information about the people: traits like gender and age as well as about physical and psychological states and behavior. This work uses wearable multimodal sensors to investigate interpersonal communication behaviors focusing on speaking patterns among healthcare providers with a focus on nurses. We analyze longitudinal data collected from 99 nurses in a large hospital setting over ten weeks. The results indicate that speaking pattern differences across shift schedules and working units. Moreover, results show that speaking patterns combined with physiological measures can be used to predict affect measures and life satisfaction scores. The implementation of this work can be accessed at
... There are no search results that directly answer the question of how to deploy cleaning crews during periods of minimal track usage. The search results are about various topics such as hole cleaning during drilling operations (Ashok et al., 2021), social density monitoring for selective cleaning (Vu Le et al., 2021), audio recording (Feng et al., 2018) and noise removal (Goyal et al., 2021), deepwater drilling operations (Johnson et al., 2021), and robotic SPECT imaging. It is possible that the question is too specific or there is not enough information to provide a relevant answer. ...
Conference Paper
Full-text available
This article explores the application of artificial intelligence (AI) techniques in the cleaning of railway tracks and its impact on operational efficiency and safety. By leveraging AI-enabled robots, predictive analytics, real-time monitoring, and automated track inspection systems, railways can achieve enhanced cleanliness, optimized cleaning schedules, and proactive maintenance measures. AI-powered robots equipped with advanced sensors and machine learning algorithms autonomously identify and remove debris, minimizing the risk of accidents and ensuring uninterrupted train operations. Predictive analytics models analyze historical data and passenger flow information to optimize cleaning schedules, reducing disruptions and improving efficiency. Real-time monitoring systems detect potential maintenance issues in advance, allowing for timely preventive measures. Automated track inspection systems powered by AI proactively detect anomalies, ensuring a higher level of quality assurance and facilitating prompt repairs. As AI technology advances, the railway industry can anticipate further innovations, revolutionizing track maintenance and contributing to a more reliable and safe transportation system.
... In this study, researchers instructed participants to wear a Fitbit Charge 2 [25], an OMsignal garment-based sensor [33], and a customized audio badge [18] which collectively tracks heart rate, physical activity, speech characteristics, and many other human-centric signals. Participants were asked to wear the OMsignal garment and audio badge only during their work shifts due to the battery limitation of these devices. ...
Full-text available
Continuously-worn wearable sensors enable researchers to collect copious amounts of rich bio-behavioral time series recordings of real-life activities of daily living, offering unprecedented opportunities to infer novel human behavior patterns during daily routines. Existing approaches to routine discovery through bio-behavioral data rely either on pre-defined notions of activities or use additional non-behavioral measurements as contexts, such as GPS location or localization within the home, presenting risks to user privacy. In this work, we propose a novel wearable time-series mining framework, Hawkes point process On Time series clusters for ROutine Discovery (HOT-ROD), for uncovering behavioral routines from completely unlabeled wearable recordings. We utilize a covariance-based method to generate time-series clusters and discover routines via the Hawkes point process learning algorithm. We empirically validate our approach for extracting routine behaviors using a completely unlabeled time-series collected continuously from over 100 individuals both in and outside of the workplace during a period of ten weeks. Furthermore, we demonstrate this approach intuitively captures daily transitional relationships between physical activity states without using prior knowledge. We also show that the learned behavioral patterns can assist in illuminating an individual's personality and affect.
... Participants engaged in their typical daily activities while being equipped with ambulatory wearable devices and sensors to collect vocal acoustic and physiological signals throughout the period of data collection. A Fitbit Charge 2 was used to measure sleep activity and exercise, an OMsignal garment collected heart rate and breathing rate, and the Unihertz Jelly Pro smartphone, a small and lightweight phone worn on the lapel, was programmed to obtain vocal acoustic features from statistically sampled egocentric audio recordings (91). ...
Full-text available
Introduction: Intelligent ambulatory tracking can assist in the automatic detection of psychological and emotional states relevant to the mental health changes of professionals with high-stakes job responsibilities, such as healthcare workers. However, well-known differences in the variability of ambulatory data across individuals challenge many existing automated approaches seeking to learn a generalizable means of well-being estimation. This paper proposes a novel metric learning technique that improves the accuracy and generalizability of automated well-being estimation by reducing inter-individual variability while preserving the variability pertaining to the behavioral construct. Methods: The metric learning technique implemented in this paper entails learning a transformed multimodal feature space from pairwise similarity information between (dis)similar samples per participant via a Siamese neural network. Improved accuracy via personalization is further achieved by considering the trait characteristics of each individual as additional input to the metric learning models, as well as individual trait base cluster criteria to group participants followed by training a metric learning model for each group. Results: The outcomes of the proposed models demonstrate significant improvement over the other inter-individual variability reduction and deep neural baseline methods for stress, anxiety, positive affect, and negative affect. Discussion: This study lays the foundation for accurate estimation of psychological and emotional states in realistic and ambulatory environments leading to early diagnosis of mental health changes and enabling just-in-time adaptive interventions.
... These two sensing strategies promised to provide useful audio features that were secure from reconstruction attacks. Additionally, Feng et al. (2018) introduced a wearable audio solution that enhances privacy through sampling low-level acoustic characteristics instead of raw audio samples to study workplace stress (Mundnich et al. (2020) and Yau et al. (2022)). On the other hand, there has been a growing interest in recent years in using a trusted computing environment (TEE) in speech-centric applications. ...
Full-text available
Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over privacy breaches, discriminating performance, and vulnerability to adversarial attacks have all been discovered in ML research fields. In order to address the above challenges and risks, a significant number of efforts have been made to ensure these ML systems are trustworthy, especially private, safe, and fair. In this paper, we conduct the first comprehensive survey on speech-centric trustworthy ML topics related to privacy, safety, and fairness. In addition to serving as a summary report for the research community, we point out several promising future research directions to inspire the researchers who wish to explore further in this area.
... The model achieved 82.6% accuracy, 80.2% SHR, and 14.9% FAR. For online evaluation, we played 15-minute audio collected from a naturalistic context through a loudspeaker as the VADLite app performed real-time classification of the audio just like was done by Feng et al [25]. VADLite had SHR and FAR of 91.6% and 5.5% respectively. ...
Full-text available
Dyadic interactions of couples are of interest as they provide insight into relationship quality and chronic disease management. Currently, ambulatory assessment of couples' interactions entails collecting data at random or scheduled times which could miss significant couples' interaction/conversation moments. In this work, we developed, deployed and evaluated DyMand, a novel open-source smartwatch and smartphone system for collecting self-report and sensor data from couples based on partners' interaction moments. Our smartwatch-based algorithm uses the Bluetooth signal strength between two smartwatches each worn by one partner, and a voice activity detection machine-learning algorithm to infer that the partners are interacting, and then to trigger data collection. We deployed the DyMand system in a 7-day field study and collected data about social support, emotional well-being, and health behavior from 13 (N=26) Swiss-based heterosexual couples managing diabetes mellitus type 2 of one partner. Our system triggered 99.1% of the expected number of sensor and self-report data when the app was running, and 77.6% of algorithm-triggered recordings contained partners' conversation moments compared to 43.8% for scheduled triggers. The usability evaluation showed that DyMand was easy to use. DyMand can be used by social, clinical, or health psychology researchers to understand the social dynamics of couples in everyday life, and for developing and delivering behavioral interventions for couples who are managing chronic diseases.
... A typical centralized SER system has three parts: data acquisition, data transfer, and emotion classification [5]. Under this framework, the client typically shares the raw speech samples or the acoustic features derived from the speech samples (to obfuscate the actual content of the conversation) to the remote cloud servers for emotion recognition [6]. However, the same speech signal carries rich information about individual traits (e.g., age, gender) and states (e.g., health status), many of which can be deemed sensitive from an application point of view. ...
Full-text available
Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive demographic traits such as gender, age and language background. Consequently, it is desirable for SER systems to have the ability to classify emotion constructs while preventing unintended/improper inferences of sensitive and demographic information. Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing their local data. This training approach appears secure and can improve privacy for SER. However, recent works have demonstrated that FL approaches are still vulnerable to various privacy attacks like reconstruction attacks and membership inference attacks. Although most of these have focused on computer vision applications, such information leakages exist in the SER systems trained using the FL technique. To assess the information leakage of SER systems trained using FL, we propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters, corresponding to the FedSGD and the FedAvg training algorithms, respectively. As a use case, we empirically evaluate our approach for predicting the client's gender information using three SER benchmark datasets: IEMOCAP, CREMA-D, and MSP-Improv. We show that the attribute inference attack is achievable for SER systems trained using FL. We further identify that most information leakage possibly comes from the first layer in the SER model.
Full-text available
Social networks are the persons surrounding a patient who provide support, circulate information, and influence health behaviors. For patients seen by neurologists, social networks are one of the most proximate social determinants of health that are actually accessible to clinicians, compared with wider social forces such as structural inequalities. We can measure social networks and related phenomena of social connection using a growing set of scalable and quantitative tools increasing familiarity with social network effects and mechanisms. This scientific approach is built on decades of neurobiological and psychological research highlighting the impact of the social environment on physical and mental well-being, nervous system structure, and neuro-recovery. Here, we review the biology and psychology of social networks, assessment methods including novel social sensors, and the design of network interventions and social therapeutics.
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach’s alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Full-text available
Physiological mechano-acoustic signals, often with frequencies and intensities that are beyond those associated with the audible range, provide information of great clinical utility. Stethoscopes and digital accelerometers in conventional packages can capture some relevant data, but neither is suitable for use in a continuous, wearable mode, and both have shortcomings associated with mechanical transduction of signals through the skin. We report a soft, conformal class of device configured specifically for mechano-acoustic recording from the skin, capable of being used on nearly any part of the body, in forms that maximize detectable signals and allow for multimodal operation, such as electrophysiological recording. Experimental and computational studies highlight the key roles of low effective modulus and low areal mass density for effective operation in this type of measurement mode on the skin. Demonstrations involving seismocardiography and heart murmur detection in a series of cardiac patients illustrate utility in advanced clinical diagnostics. Monitoring of pump thrombosis in ventricular assist devices provides an example in characterization of mechanical implants. Speech recognition and human-machine interfaces represent additional demonstrated applications. These and other possibilities suggest broad-ranging uses for soft, skin-integrated digital technologies that can capture human body acoustics.
Full-text available
There has been increasing attention in the literature to wearable acoustic recording devices, particularly to examine naturalistic speech in disordered and child populations. Recordings are typically analyzed using automatic procedures that critically depend on the reliability of the collected signal. This work describes the acoustic amplitude response characteristics and the possibility of acoustic transmission loss using several shirts designed for wearable recorders. No difference was observed between the response characteristics of different shirt types or between shirts and the bare-microphone condition. Results are relevant for research, clinical, educational, and home applications in both practical and theoretical terms.
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach's alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Full-text available
Multi-microphone arrays allow for the use of spatial filtering techniques that can greatly improve noise reduction and source separation. However, for speech and audio data, work on noise reduction or separation has focused primarily on one- or two-channel systems. Because of this, databases of multichannel environmental noise are not widely available. DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) addresses this problem by providing a set of 16-channel noise files recorded in a variety of indoor and outdoor settings. The data was recorded using a planar microphone array consisting of four staggered rows, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. DEMAND is freely available under a Creative Commons license to encourage research into algorithms beyond the stereo setup.
Full-text available
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Full-text available
A recording device called the Electronically Activated Recorder (EAR) is described. The EAR tape-records for 30 sec once every 12 min for 2–4 days. It is lightweight and portable, and it can be worn comfortably by participants in their natural environment. The acoustic data samples provide a nonobtrusive record of the language used and settings entered by the participant. Preliminary psychometric findings suggest that the EAR data accurately reflect individuals’ natural social, linguistic, and psychological lives. The data presented in this article were collected with a first-generation EAR system based on analog tape recording technology, but a second generation digital EAR is now available.
The Language ENvironment Analysis (LENA) System is a relatively new recording technology that can be used to investigate typical child language acquisition and populations with language disorders. The purpose of this paper is to familiarize language acquisition researchers and speech-language pathologists with how the LENA System is currently being used in research. The authors outline issues in peer-reviewed research based on the device. Considerations when using the LENA System are discussed.
This paper presents TarsosDSP, a framework for real-time audio analysis and processing. Most libraries and frameworks offer either audio analysis and feature extraction or audio synthesis and processing. TarsosDSP is one of a only a few frameworks that offers both analysis, processing and feature extraction in real-time, a unique feature in the Java ecosystem. The framework contains practical audio processing algorithms, it can be extended easily, and has no external dependencies. Each algorithm is implemented as simple as possible thanks to a straightforward processing pipeline. TarsosDSP's features include a resampling algorithm, onset detectors, a number of pitch estimation algorithms, a time stretch algorithm, a pitch shifting algorithm, and an algorithm to calculate the Constant-Q. The framework also allows simple audio synthesis, some audio effects, and several filters. The Open Source framework is a valuable contribution to the MIR-Community and ideal fit for interactive MIR-applications on Android.
Given C samples, with ni observations in the ith sample, a test of the hypothesis that the samples are from the same population may be made by ranking the observations from from 1 to Σni (giving each observation in a group of ties the mean of the ranks tied for), finding the C sums of ranks, and computing a statistic H. Under the stated hypothesis, H is distributed approximately as χ(C – 1), unless the samples are too small, in which case special approximations or exact tables are provided. One of the most important applications of the test is in detecting differences among the population means.** Based in part on research supported by the Office of Naval Research at the Statistical Research Center, University of Chicago.