Conference Paper

TILES audio recorder: an unobtrusive wearable solution to track audio activity

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Most existing speech activity trackers used in human subject studies are bulky, record raw audio content which invades participant privacy, have complicated hardware and non-customizable software, and are too expensive for large-scale deployment. The present effort seeks to overcome these challenges by proposing the TILES Audio Recorder (TAR) - an unobtrusive and scalable solution to track audio activity using an affordable miniature mobile device with an open-source app. For this recorder, we make use of Jelly Pro Mobile, a pocket-sized Android smartphone, and employ two open-source toolkits: openSMILE and Tarsos-DSP. Tarsos-DSP provides a Voice Activity Detection capability that triggers openSMILE to extract and save audio features only when the subject is speaking. Experiments show that performing feature extraction only during speech segments greatly increases battery life, enabling the subject to wear the recorder up to 10 hours at time. Furthermore, recording experiments with ground-truth clean speech show minimal distortion of the recorded features, as measured by root mean-square error and cosine distance. The TAR app further provides subjects with a simple user interface that allows them to both pause feature extraction at any time and also easily upload data to a remote server.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, at work, they were asked to wear an OMsignal smart garment [34] 1 (a T-shirt for men and a sports bra for women) and a Unihertz Jelly Pro smartphone [35] (Jelly phone, for short) as a lapel microphone ("audio badge"). The Jelly phone was programmed to obtain audio features from the raw audio (which was discarded) [36]. In parallel, these Jelly phones were also sending Bluetooth packets every second over 15s windows every minute, to estimate their locations within the building/work place. ...
... These were either clipped to participants' clothing near the neckline or placed in a shirt pocket. The cases of the Jelly phones were modified, such that the microphone pointed upwards, as described in [36], to better capture the speech data from the wearer. Participants were asked to charge the Jelly phone prior to each work shift, unlock the Jelly phone, check that the TILES Audio app [36] was running, and upload the audio data at the end of each work shift by pressing the UPLOAD DATA button in the TILES Audio app. ...
... The cases of the Jelly phones were modified, such that the microphone pointed upwards, as described in [36], to better capture the speech data from the wearer. Participants were asked to charge the Jelly phone prior to each work shift, unlock the Jelly phone, check that the TILES Audio app [36] was running, and upload the audio data at the end of each work shift by pressing the UPLOAD DATA button in the TILES Audio app. Each Jelly phone was linked to the TILES app on each participant's mobile device by scanning a QR code in the TILES app. ...
Preprint
We present a novel longitudinal multimodal corpus of physiological and behavioral data collected from direct clinical providers in a hospital workplace. We designed the study to investigate the use of off-the-shelf wearable and environmental sensors to understand individual-specific constructs such as job performance, interpersonal interaction, and well-being of hospital workers over time in their natural day-to-day job settings. We collected behavioral and physiological data from $n = 212$ participants through Internet-of-Things Bluetooth data hubs, wearable sensors (including a wristband, a biometrics-tracking garment, a smartphone, and an audio-feature recorder), together with a battery of surveys to assess personality traits, behavioral states, job performance, and well-being over time. Besides the default use of the data set, we envision several novel research opportunities and potential applications, including multi-modal and multi-task behavioral modeling, authentication through biometrics, and privacy-aware and privacy-preserving machine learning.
... In this paper, we address the task of identifying and characterizing dynamically varying acoustic scenes in a workplace setting from egocentric audio recordings obtained through audio recorders worn as badges by individuals [13]. There are three fundamental differences between the task at hand and standard ASC tasks [1,6]. ...
... There are three fundamental differences between the task at hand and standard ASC tasks [1,6]. First, to get an egocentric view, we employ wearable microphones for audio feature collection [13]. The employees wear their audio recorders throughout multiple work shifts. ...
... We collected multi-modal sensory data (audio, physiology, continuous location, etc.) from 350 nurses and other direct clinical providers in a critical care hospital 2 . The data was collected through audio badges [13] developed in-house, which the participants wore during their work shifts. Each participant went through the data collection procedure in multiple work shifts, each typically lasted from 8 to 12 hours. ...
Preprint
Devices capable of detecting and categorizing acoustic scenes have numerous applications such as providing context-aware user experiences. In this paper, we address the task of characterizing acoustic scenes in a workplace setting from audio recordings collected with wearable microphones. The acoustic scenes, tracked with Bluetooth transceivers, vary dynamically with time from the egocentric perspective of a mobile user. Our dataset contains experience sampled long audio recordings collected from clinical providers in a hospital, who wore the audio badges during multiple work shifts. To handle the long egocentric recordings, we propose a Time Delay Neural Network~(TDNN)-based segment-level modeling. The experiments show that TDNN outperforms other models in the acoustic scene classification task. We investigate the effect of primary speaker's speech in determining acoustic scenes from audio badges, and provide a comparison between performance of different models. Moreover, we explore the relationship between the sequence of acoustic scenes experienced by the users and the nature of their jobs, and find that the scene sequence predicted by our model tend to possess similar relationship. The initial promising results reveal numerous research directions for acoustic scene classification via wearable devices as well as egocentric analysis of dynamic acoustic scenes encountered by the users.
... Jelly phone for short) as a lapel microphone (or "audio badge"). The Jelly phone was programmed to obtain audio features from the raw audio (which was discarded) 32 . ...
... www.nature.com/scientificdata/ were modified, such that the microphone pointed upwards, as described in 32 , to better capture the speech data from the wearer. Participants were asked to charge the Jelly phone prior to each work shift, unlock the Jelly phone, check that the TILES Audio app 32 was running, and upload the audio data at the end of each work shift by pressing the UPLOAD DATA button in the TILES Audio app. ...
... We presented an analysis of the audio recorder in 32 . TAR primarily extracts the audio features using openSMILE 42 . ...
Article
Full-text available
Measurement(s) Overall Sleep Quality Rating • Step Unit of Distance • Speech • Mean Heart Rate • Proximity • Electrocardiogram Sequence • heart rate variability measurement • Respiratory Rate • physical activity measurement • light • door motion • Changes in Ambient Temperature in Medical Device Environment • humidity • Overall Emotional Well-Being • Stress • psychological flexibility • work-related acceptance • work engagement • psychological capital • intelligence • job performance • organizational citizenship behavior • counter-productive work behavior • personality trait measurement • Negative affectivity • positive affectivity • anxiety-related behavior trait • Alcohol Use History • Overall Health Rating During Past Week Technology Type(s) photoplethysmography • Accelerometer • Microphone Device • Bluetooth-enabled Activity Monitor • electrocardiogram • Sensor Device • Photodetector Device • Temperature Sensor Device • questionnaire • Multidimensional Psychological Flexibility Inventory (MPFI) • Utrecht work engagement scale • survey method • individual task proficiency • Search Results Web results Organizational Citizenship Behavior Checklist • big five inventory • Positive and Negative Affect Schedule (PANAS-X) • State-Trait Anxiety Inventory Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment hospital Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.12465101
... Others have also shown changes in speech jitter and in mel-frequency cepstral coefficients (MFCC) [11]. Portable audio solutions have been developed to allow for unobtrusive audio activity monitoring, thus opening the doors for wearable audio solutions [12]. ...
... Participants were asked to wear a Fitbit charge 2 wristband at all times to collect their heart rate, sleep quality, and activity. In addition, at work, they were asked to wear a Unihertz Jelly Pro smartphone that was programmed to act as a personal audio feature recorder, termed TILES Audio Recorder (TAR) [12]. Due to privacy concerns, audio features were extracted on the device and the actual audio was discarded. ...
Conference Paper
Hospital workers are known to work long hours in a highly stressful environment. The COVID-19 pandemic has increased this burden multi-fold. Pre-COVID statistics already showed that one in every three nurses reported burnout, thus affecting patient satisfaction and the quality of their provided service. Real-time monitoring of burnout, and other underlying factors, such as stress, could provide feedback not only to the clinical staff, but also to hospital administrators, thus allowing for supportive measures to be taken early. In this paper, we present a context-aware speech-based system for stress detection. We consider data from 144 hospital workers who were monitored during their daily shifts over a 10-week period; subjective stress readings were collected daily. Wearable devices measured speech features and physiological readings, such as heart rate. Environment sensors, in turn, were used to track staff movement within the hospital. Here, we show the importance of context-awareness for stress level detection based on a bidirectional LSTM deep neural network. In particular, we show the importance of hospital location and circadian rhythm based contextual cues for stress prediction. Overall, we show improvements as high as 14% in F1 scores once context is incorporated, relative to using the speech features alone.
... Researchers should be aware that recording raw audio may not be permitted without the express permission from all participants and bystanders, so anonymized audio features may need to be extracted and stored instead. If human-produced audio is of primary interest, smartphone apps have been developed and used by researchers like the Electronically Activated Recorder [21] and TILES Audio Recorder [22] which are described more in the Wearable Sensors section. ...
... Voice: Different wearable devices have been proposed by researchers for understanding emotions and other aspects of speech in social situations, such as the Sociometer [36], the EAR [21] and subsequent iEAR app, and the TILES audio recorder (TAR) [22] app. Privacy is a major concern when audio recordings are collected in public settings and thus some apps, like TAR, are designed to only collect and record acoustic features from human-produced audio. ...
Preprint
UNSTRUCTURED Recent advances in mobile technologies for sensing human biosignals are empowering researchers to collect real-world data outside of the laboratory. This free-range data captures more authentic representations of behaviors and activities because it occurs in natural settings where participants are able to perform their daily activities with minimal disruption. The huge gains in data fidelity that these mobile technologies create also usher a host of challenges and constraints for both researchers and participants that require careful consideration and planning to minimize both direct and indirect effects on data quality. To help confront these challenges, we outline a general data collection framework for modern portable sensors and describe the criteria impacting sensor selection, management, and integration that researchers should consider before beginning a human behavior study. We also provide a snapshot of modern consumer technologies for sensing human behavior in natural settings and detail a case study where the described challenges are overcome to produce a high-caliber data set gathered semi-transparently from workers in a hospital work environment.
... A typical centralized SER system has three parts: data acquisition, data transfer, and emotion classification [5]. Under this framework, the client typically shares the raw speech samples or the acoustic features derived from the speech samples (to obfuscate the actual content of the conversation) to the remote cloud servers for emotion recognition [6]. However, the same speech signal carries rich information about individual traits (e.g., age, gender) and states (e.g., health status), many of which can be deemed sensitive from an application point of view. ...
Preprint
Full-text available
Speech emotion recognition (SER) processes speech signals to detect and characterize expressed perceived emotions. Many SER application systems often acquire and transmit speech data collected at the client-side to remote cloud platforms for inference and decision making. However, speech data carry rich information not only about emotions conveyed in vocal expressions, but also other sensitive demographic traits such as gender, age and language background. Consequently, it is desirable for SER systems to have the ability to classify emotion constructs while preventing unintended/improper inferences of sensitive and demographic information. Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing their local data. This training approach appears secure and can improve privacy for SER. However, recent works have demonstrated that FL approaches are still vulnerable to various privacy attacks like reconstruction attacks and membership inference attacks. Although most of these have focused on computer vision applications, such information leakages exist in the SER systems trained using the FL technique. To assess the information leakage of SER systems trained using FL, we propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters, corresponding to the FedSGD and the FedAvg training algorithms, respectively. As a use case, we empirically evaluate our approach for predicting the client's gender information using three SER benchmark datasets: IEMOCAP, CREMA-D, and MSP-Improv. We show that the attribute inference attack is achievable for SER systems trained using FL. We further identify that most information leakage possibly comes from the first layer in the SER model.
... Participants were outfitted with multiple wearable sensors to collect a variety of biometric data, including audio features, heart rate, respiratory rate, and sleep quality. A custom audiometric badge was used, as detailed in [20], along with a Fitbit Charge 2 and an OMsignal smartshirt. In this paper, only the RR time series measured by the OMsignal smartshirt are used. ...
Conference Paper
Heart rate variability (HRV) has been studied in the context of human behavior analysis and many features have been extracted from the inter-beat interval (RR) time series and tested as correlates of constructs such as mental workload, stress and anxiety. Most studies, however, have been conducted in controlled laboratory environments with artificially-induced psychological responses. While this assures that high quality data are collected, the amount of data is limited and the transferability of the findings to more ecologically-appropriate settings (i.e., "in-the-wild") remains unknown. In this paper, we explore the use of motif-based multi-scale HRV features to predict anxiety and stress in-the-wild. To further improve their robustness to artifacts, we propose a quality-aware feature aggregation method. The new quality-aware features are tested on a dataset collected using a wearable biometric sensor from over 200 hospital workers (nurses and staff) during their work shifts. Results show improved stress/anxiety measurement over using conventional time- and frequency-domain HRV measures.
... Testers for the Jelly phone, the sensor that was to track audio features, included team members and volunteers who all worked in a university setting. The Jelly phone was initially programmed for voice activity detection (VAD), i.e., to turn on audio capture once it detected a voice [36]; the battery lasted for approximately 10-12 hours for our pilot testers. ...
Preprint
UNSTRUCTURED While traditional methods of data collection in naturalistic settings can shed light on constructs of interest to researchers, advances in sensor-based technology allow researchers to capture continuous physiological and behavioral data to provide a more comprehensive understanding of the constructs that are examined in a dynamic healthcare setting. This paper gives examples for implementing technology-facilitated approaches and provides the following recommendations for conducting such longitudinal research in a healthcare setting: pilot test early and often; build trust both with key stakeholders and with potential participants; employ multiple, sample-specific recruitment methods; develop various strategies to sustain and enhance participant compliance; and adopt a flexible approach to project management. The paper describes how these recommendations were successfully implemented by providing examples from a large-scale, sensor-based, longitudinal study of hospital employees. The knowledge gained from this paper may be helpful to researchers interested in obtaining dynamic, longitudinal data in a healthcare setting to obtain a more comprehensive understanding of constructs of interest in an ecologically valid, secure, and efficient way.
... Researchers must explain the issues and enable participants to make informed decisions about participation and use of their data. Some devices, such as the TILES recorder (Feng et al., 2018), provide investigators the flexibility to extract and collect data on only specific features (e.g., speech onset and offset time, pitch estimates) from the audio input. Collecting feature data alone may better preserve privacy but may not be suitable for every research question and limits reanalysis when improved automatic audio-processing tools become available. ...
Article
Full-text available
The sounds of human infancy—baby babbling, adult talking, lullaby singing, and more—fluctuate over time. Infant-friendly wearable audio recorders can now capture very large quantities of these sounds throughout infants’ everyday lives at home. Here, we review recent discoveries about how infants’ soundscapes are organized over the course of a day. Analyses designed to detect patterns in infants’ daylong audio at multiple timescales have revealed that everyday vocalizations are clustered hierarchically in time, that vocal explorations are consistent with foraging dynamics, and that some musical tunes occur for much longer cumulative durations than others. This approach focusing on the multiscale distributions of sounds heard and produced by infants is providing new, fundamental insights on human communication development from a complex-systems perspective.
... Since we focus in particular on motif analysis of proximity data and prediction of mental wellness measures we only provide details about these data streams. Participant proximity to different locations within each nursing unit is tracked using Bluetooth advertisement packets sent from a smartphone system [15] participants wear while at work that is picked up by Bluetooth hubs installed throughout the hospital. These hubs are located in key rooms including patient rooms, medicine rooms, lounge and break rooms, nursing stations (computer desks), and laboratories. ...
... The model achieved 82.6% accuracy, 80.2% SHR, and 14.9% FAR. For online evaluation, we played 15-minute audio collected from a naturalistic context through a loudspeaker as the VADLite app performed real-time classification of the audio just like was done by Feng et al [25]. VADLite had SHR and FAR of 91.6% and 5.5% respectively. ...
Preprint
Full-text available
Dyadic interactions of couples are of interest as they provide insight into relationship quality and chronic disease management. Currently, ambulatory assessment of couples' interactions entails collecting data at random or scheduled times which could miss significant couples' interaction/conversation moments. In this work, we developed, deployed and evaluated DyMand, a novel open-source smartwatch and smartphone system for collecting self-report and sensor data from couples based on partners' interaction moments. Our smartwatch-based algorithm uses the Bluetooth signal strength between two smartwatches each worn by one partner, and a voice activity detection machine-learning algorithm to infer that the partners are interacting, and then to trigger data collection. We deployed the DyMand system in a 7-day field study and collected data about social support, emotional well-being, and health behavior from 13 (N=26) Swiss-based heterosexual couples managing diabetes mellitus type 2 of one partner. Our system triggered 99.1% of the expected number of sensor and self-report data when the app was running, and 77.6% of algorithm-triggered recordings contained partners' conversation moments compared to 43.8% for scheduled triggers. The usability evaluation showed that DyMand was easy to use. DyMand can be used by social, clinical, or health psychology researchers to understand the social dynamics of couples in everyday life, and for developing and delivering behavioral interventions for couples who are managing chronic diseases.
... Practically, it is helpful that the program's rate of false negative decisions (i.e., human voice misclassified as no voice) tends to be quite low relative to the rate of false positive decisions (i.e., no voice misclassified as voice). Future research is necessary to improve automatic conversation detection algorithms, for example, through incorporation of foreground and background speech recognition (Feng et al., 2018;Nadarajan et al., 2019) and implementation of speaker identification (e.g., to automatically differentiate the participant's voice from bystanders' voices). ...
Chapter
Full-text available
As proposed by the healthy aging model of the World Health Organization (2015), participation in social and cognitive activities is important for healthy aging and for understanding individual differences in personality and health trajectories. This chapter introduces a naturalistic observation method for assessing social and cognitive activities in everyday life: Electronically Activated Recorder (EAR) – a portable audio recorder that periodically and unobtrusively records ambient sounds and speech in everyday life. Using this method, researchers have reliably and objectively measured social activities (e.g., conversing, engaging in small talk or substantive conversations) and cognitive activities (e.g., reminiscing, planning the future, grammatical complexity) in everyday life. We reviewed studies that examined interindividual and intraindividual differences in these activities. Furthermore, we discussed how this behavioural evidence could help to understand personality and health in old age. In sum, the EAR method offers useful real-life, personality-related data for the healthy aging literature.
... To evaluate the real-time performance of VADLite with real-world data, we recorded audio data from a naturalistic context. We then played the recorded audio through a loudspeaker as the VADLite app performed real-time classification of the audio just like was done by Feng et al [11]. The audio had a duration of 15 minutes each for speech and noise. ...
Conference Paper
Full-text available
Smartwatches provide a unique opportunity to collect more speech data because they are always with the user and also have a more exposed microphone compared to smartphones. Speech data could be used to infer various indicators of mental well being such as emotions, stress and social activity. Hence, real-time voice activity detection (VAD) on smartwatches could enable the development of applications for mental health monitoring. In this work, we present VADLite, an open-source, lightweight, system that performs real-time VAD on smartwatches. It extracts mel-frequency cepstral coefficients and classifies speech versus non-speech audio samples using a linear Support Vector Machine. The real-time implementation is done on the Wear OS Polar M600 smartwatch. An offline and online evaluation of VADLite using real-world data showed better performance than WebRTC's open-source VAD system. VADLite can be easily integrated into Wear OS projects that need a lightweight VAD module running on a smartwatch.
... Participants were outfitted with multiple wearable devices to collect a variety of biometric data such as vocal audio features [11], heart rate, respiration rate, and sleep quality, among others. Heart rate data were acquired simultaneously with an OMsignal smart-shirt and a Fitbit Charge 2 smart-bracelet. ...
Conference Paper
Full-text available
In recent years, consumer wearable devices focused on health assessment have gained popularity. Of these devices, a large number target monitoring heart rate; a few among them include additional biometrics such as breathing rate, galvanic skin response, and skin temperature. Heart rate, and more specifically, heart rate variability (HRV) measures have proven useful in monitoring user psychological states, such as mental workload, stress and anxiety. Most studies, however, have been conducted in controlled laboratory environments with artificially-induced psychological responses. While these conditions assure high quality in the collected data, the amount of data are limited and the generalization of the findings to more ecologically-appropriate settings remains unknown. To this end, in this paper we compare the accuracy of two wearable devices, namely a smart-shirt measuring electrocardiograms and a smart-bracelet measuring photoplethysmograms. Several HRV features are extracted and tested as correlates of stress and anxiety. Data were collected from 196 participants during their normal work shifts for a period of 10 weeks. The complementarity of the two devices is also explored and the advantages of each method are discussed.
Article
Although traditional methods of data collection in naturalistic settings can shed light on constructs of interest to researchers, advances in sensor-based technology allow researchers to capture continuous physiological and behavioral data to provide a more comprehensive understanding of the constructs that are examined in a dynamic health care setting. This study gives examples for implementing technology-facilitated approaches and provides the following recommendations for conducting such longitudinal, sensor-based research, with both environmental and wearable sensors in a health care setting: pilot test sensors and software early and often; build trust with key stakeholders and with potential participants who may be wary of sensor-based data collection and concerned about privacy; generate excitement for novel, new technology during recruitment; monitor incoming sensor data to troubleshoot sensor issues; and consider the logistical constraints of sensor-based research. The study describes how these recommendations were successfully implemented by providing examples from a large-scale, longitudinal, sensor-based study of hospital employees at a large hospital in California. The knowledge gained from this study may be helpful to researchers interested in obtaining dynamic, longitudinal sensor data from both wearable and environmental sensors in a health care setting (eg, a hospital) to obtain a more comprehensive understanding of constructs of interest in an ecologically valid, secure, and efficient way.
Article
Identification of the acoustic environment from an audio recording, also known as acoustic scene classification, is an active area of research. In this paper, we study dynamically-changing background acoustic scenes from the egocentric perspective of an individual in a workplace. In a novel data collection setup, wearable sensors were deployed on individuals to collect audio signals within a built environment, while Bluetooth-based hubs continuously tracked the individual's location which represents the acoustic scene at a certain time. The data of this paper come from 170 hospital workers gathered continuously during work shifts for a 10 week period. In the first part of our study, we investigate temporal patterns in the egocentric sequence of acoustic scenes encountered by an employee, and the association of those patterns with factors such as job-role and daily routine of the individual. Motivated by evidence of multifaceted effects of ambient sounds on human psychology, we also analyze the association of the temporal dynamics of the perceived acoustic scenes with particular behavioral traits of the individual. Experiments reveal rich temporal patterns in the acoustic scenes experienced by the individuals during their work shifts, and a strong association of those patterns with various constructs related to job-roles and behavior of the employees. In the second part of our study, we employ deep learning models to predict the temporal sequence of acoustic scenes from the egocentric audio signal. We propose a two-stage framework where a recurrent neural network is trained on top of the latent acoustic representations learned by a segment-level neural network. The experimental results show the efficacy of the proposed system in predicting sequence of acoustic scenes, highlighting the existence of underlying temporal patterns in the acoustic scenes experienced in workplace.
Conference Paper
As hospital workers face a growing number of patients and have to meet increasingly rigorous standards of care, their ability to successfully modulate their emotional reactions and flexibly handle stress presents a significant challenge. This paper examines a multimodal signal-driven way to quantify emotion self-regulation and stress spillover through a dynamical systems model (DSM). The proposed DSM models day-to-day changes of emotional arousal, captured through speech, physiology, and daily activity measures, and its interplay with daily stress. The parameters of the DSM quantify the degree of self-regulation and stress spillover, and are associated with work performance and cognitive ability in a multimodal dataset of 130 full-time hospital workers recorded over a 10-week period. Linear regression experiments indicate the effectiveness of the proposed features to reliably estimate individuals' work performance and cognitive ability, providing significantly higher Pearson's correlations compared to aggregate measures of emotional arousal. Results from this study demonstrate the importance of quantifying oscillatory behaviors from longitudinal ambulatory signals and can potentially deepen our understanding of emotion self-regulation and stress spillover using signal-driven measurements, which complement self-reports and provide estimates of the psychological constructs of interest in a fine-grained time resolution.
Conference Paper
The work of nurses is often associated with elevated anxiety, negative affect, and fatigue, all of which may impact both the quality of patient care and their own well-being. It is critical to understand behavioral patterns, such as human movement, that may be associated with these workplace challenges of nurses. These movement behaviors include location-based movement patterns and dynamical changes of movement intensity. Particularly, we investigated these movement-related patterns for 75 nurses, using wearable sensor recordings, collected over a continuous period of ten weeks. We first discover the location of movement patterns from the Bluetooth proximity data using topic models. We then extract the heart rate zone features from PPG readings to infer the intensity of physical movement. Our results show that the location movement patterns and dynamical changes of movement intensity offer key insights into understanding the workplace behavior of the nursing population in a complex hospital setting.
Article
Full-text available
Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.
Article
Social networks are the persons surrounding a patient who provide support, circulate information, and influence health behaviors. For patients seen by neurologists, social networks are one of the most proximate social determinants of health that are actually accessible to clinicians, compared with wider social forces such as structural inequalities. We can measure social networks and related phenomena of social connection using a growing set of scalable and quantitative tools increasing familiarity with social network effects and mechanisms. This scientific approach is built on decades of neurobiological and psychological research highlighting the impact of the social environment on physical and mental well-being, nervous system structure, and neuro-recovery. Here, we review the biology and psychology of social networks, assessment methods including novel social sensors, and the design of network interventions and social therapeutics.
Article
Full-text available
Physiological mechano-acoustic signals, often with frequencies and intensities that are beyond those associated with the audible range, provide information of great clinical utility. Stethoscopes and digital accelerometers in conventional packages can capture some relevant data, but neither is suitable for use in a continuous, wearable mode, and both have shortcomings associated with mechanical transduction of signals through the skin. We report a soft, conformal class of device configured specifically for mechano-acoustic recording from the skin, capable of being used on nearly any part of the body, in forms that maximize detectable signals and allow for multimodal operation, such as electrophysiological recording. Experimental and computational studies highlight the key roles of low effective modulus and low areal mass density for effective operation in this type of measurement mode on the skin. Demonstrations involving seismocardiography and heart murmur detection in a series of cardiac patients illustrate utility in advanced clinical diagnostics. Monitoring of pump thrombosis in ventricular assist devices provides an example in characterization of mechanical implants. Speech recognition and human-machine interfaces represent additional demonstrated applications. These and other possibilities suggest broad-ranging uses for soft, skin-integrated digital technologies that can capture human body acoustics.
Article
Full-text available
There has been increasing attention in the literature to wearable acoustic recording devices, particularly to examine naturalistic speech in disordered and child populations. Recordings are typically analyzed using automatic procedures that critically depend on the reliability of the collected signal. This work describes the acoustic amplitude response characteristics and the possibility of acoustic transmission loss using several shirts designed for wearable recorders. No difference was observed between the response characteristics of different shirt types or between shirts and the bare-microphone condition. Results are relevant for research, clinical, educational, and home applications in both practical and theoretical terms.
Article
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach's alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Article
Full-text available
Multi-microphone arrays allow for the use of spatial filtering techniques that can greatly improve noise reduction and source separation. However, for speech and audio data, work on noise reduction or separation has focused primarily on one- or two-channel systems. Because of this, databases of multichannel environmental noise are not widely available. DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) addresses this problem by providing a set of 16-channel noise files recorded in a variety of indoor and outdoor settings. The data was recorded using a planar microphone array consisting of four staggered rows, with the smallest distance between microphones being 5 cm and the largest being 21.8 cm. DEMAND is freely available under a Creative Commons license to encourage research into algorithms beyond the stereo setup.
Article
Full-text available
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Article
Full-text available
A recording device called the Electronically Activated Recorder (EAR) is described. The EAR tape-records for 30 sec once every 12 min for 2–4 days. It is lightweight and portable, and it can be worn comfortably by participants in their natural environment. The acoustic data samples provide a nonobtrusive record of the language used and settings entered by the participant. Preliminary psychometric findings suggest that the EAR data accurately reflect individuals’ natural social, linguistic, and psychological lives. The data presented in this article were collected with a first-generation EAR system based on analog tape recording technology, but a second generation digital EAR is now available.
Conference Paper
Full-text available
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
Article
Full-text available
The aim of this review paper is to summarize recent developments in the field of wearable sensors and systems that are relevant to the field of rehabilitation. The growing body of work focused on the application of wearable technology to monitor older adults and subjects with chronic conditions in the home and community settings justifies the emphasis of this review paper on summarizing clinical applications of wearable technology currently undergoing assessment rather than describing the development of new wearable sensors and systems. A short description of key enabling technologies (i.e. sensor technology, communication technology, and data analysis techniques) that have allowed researchers to implement wearable systems is followed by a detailed description of major areas of application of wearable technology. Applications described in this review paper include those that focus on health and wellness, safety, home rehabilitation, assessment of treatment efficacy, and early detection of disorders. The integration of wearable and ambient sensors is discussed in the context of achieving home monitoring of older adults and subjects with chronic conditions. Future work required to advance the field toward clinical deployment of wearable sensors and systems is discussed.
Conference Paper
Full-text available
The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.
Conference Paper
Full-text available
In this paper we discuss issues surrounding wearable computers used as intelligent health monitors. Unlike existing health monitors (for example, ECG and EEG holters), that are used mainly for data acquisition, the devices we discuss provide real-time feedback to the patient, either as a warning of impending medical emergency or as a monitoring aid during exercise. These medical applications are to be distinguished from applications of wearable computing for medical personnel, e.g. doctors, nurses, and emergency medical technicians. Medical monitoring applications differ from other wearable applications in their I/O requirements, sensors, reliability, privacy issues, and user interface. The paper describes a prototype wearable ECG monitor based upon a high-performance, low-power digital signal processor and the development environment for its design.
Article
The Language ENvironment Analysis (LENA) System is a relatively new recording technology that can be used to investigate typical child language acquisition and populations with language disorders. The purpose of this paper is to familiarize language acquisition researchers and speech-language pathologists with how the LENA System is currently being used in research. The authors outline issues in peer-reviewed research based on the device. Considerations when using the LENA System are discussed.
Article
This paper presents TarsosDSP, a framework for real-time audio analysis and processing. Most libraries and frameworks offer either audio analysis and feature extraction or audio synthesis and processing. TarsosDSP is one of a only a few frameworks that offers both analysis, processing and feature extraction in real-time, a unique feature in the Java ecosystem. The framework contains practical audio processing algorithms, it can be extended easily, and has no external dependencies. Each algorithm is implemented as simple as possible thanks to a straightforward processing pipeline. TarsosDSP's features include a resampling algorithm, onset detectors, a number of pitch estimation algorithms, a time stretch algorithm, a pitch shifting algorithm, and an algorithm to calculate the Constant-Q. The framework also allows simple audio synthesis, some audio effects, and several filters. The Open Source framework is a valuable contribution to the MIR-Community and ideal fit for interactive MIR-applications on Android.
Article
Given C samples, with ni observations in the ith sample, a test of the hypothesis that the samples are from the same population may be made by ranking the observations from from 1 to Σni (giving each observation in a group of ties the mean of the ranks tied for), finding the C sums of ranks, and computing a statistic H. Under the stated hypothesis, H is distributed approximately as χ(C – 1), unless the samples are too small, in which case special approximations or exact tables are provided. One of the most important applications of the test is in detecting differences among the population means.** Based in part on research supported by the Office of Naval Research at the Statistical Research Center, University of Chicago.
Article
In long-term prevention and in rehabilitation of health of elderly people the recording of vital signs plays an important role. Especially the progress of rehabilitation can be deduced from the recording of an electrocardigram (ECG), blood pressure and body temperature. In this paper we present a wireless coupled recording device for long-term monitoring of these vital sign signals. We record the ECG, the blood pressure and the skin temperature and include a 3D-acceleration sensor for the determination of the movements during recording. To deal with motion artifacts in all recorded properties we use data fusion to reject or correct distorted vital sign signals.
Article
We introduce the Friends and Family study, a longitudinal living laboratory in a residential community. In this study, we employ a ubiquitous computing approach, Social Functional Mechanism-design and Relationship Imaging, or Social fMRI, that combines extremely rich data collection with the ability to conduct targeted experimental interventions with study populations. We present our mobile-phone-based social and behavioral sensing system, deployed in the wild for over 15 months. Finally, we present three investigations performed during the study, looking into the connection between individuals’ social behavior and their financial status, network effects in decision making, and a novel intervention aimed at increasing physical activity in the subject population. Results demonstrate the value of social factors for choice, motivation, and adherence, and enable quantifying the contribution of different incentive mechanisms.
Article
In this paper, we describe the use of the sociometer, a wearable sensor package, for measuring face-to-face interactions between people. We develop methods for learning the structure and dynamics of human communication networks. Knowledge of how people interact is important in many disciplines, e.g. organizational behavior, social network analysis and knowledge management applications such as expert finding. At present researchers mainly have to rely on questionnaires, surveys or diaries in order to obtain data on physical interactions between people. In this paper, we show how noisy sensor measurements from the sociometer can be used to build computational models of group interactions. Using statistical pattern recognition techniques such as dynamic Bayesian network models we can automatically learn the underlying structure of the network and also analyze the dynamics of individual and group interactions. We present preliminary results on how we can learn the structure of face-to-face interactions within a group, detect when members are in face-to-face proximity and also when they are having a conversation. We also measure the duration and frequency of interactions between people and the participation level of each individual in a conversation.