Conference Paper

Investigating the Effect of Feature Distribution Shift on the Performance of Sleep Stage Classification with Consumer Sleep Trackers

Authors:
  • Kyoto University of Advanced Science
  • Kyoto University of Advanced Science
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The performance of consumer wearable sleep trackers could vary significantly from user to user. This paper presents a pilot study on the effect of the feature distribution shift between training and test datasets on the performance of sleep classification models. We used Anderson-Darling (AD) statistic to quantify the shift in feature distribution, which is then correlated to the several model performance metrics. Our results show that the distribution shift of some features were negatively correlated to the model accuracy for REM and NREM sleep classification. Future studies on consumer sleep tracking technologies may focus on addressing the dataset shift problem in the model training process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Consumer wearable activity trackers such as Fitbit are widely used in ubiquitous and longitudinal sleep monitoring in free-living environments. However, these devices are known to be inaccurate for measuring sleep stages. In this study, we develop and validate a novel approach that leverages the processed data readily available from consumer activity trackers (i.e., steps, heart rate, and sleep metrics) to predict sleep stages. The proposed approach adopts a selective correction strategy and consists of two levels of classifiers. The level-I classifier judges whether a Fitbit labelled sleep epoch is misclassified, and the level-II classifier reclassifies misclassfied epochs into one of the four sleep stages (i.e., light sleep, deep sleep, REM sleep and wakefulness). Best epoch-wise performance was achieved when support vector machine and gradient boosting decision tree (XGBoost) with up sampling were used respectively at the level-I and level-II classification. The model achieved an overall per-epoch accuracy of 0.731 ± 0.119, Cohen's Kappa of 0.433 ± 0.212, and multi-class Matthew's correlation coefficient (MMCC) of 0.451 ± 0.214. Regarding the total duration of individual sleep stage, the mean normalized absolute bias (MAB) of this model was 0.469, which is a 23.9% reduction against the proprietary Fitbit algorithm. The model that combines support vector machine and XGBoost with down sampling achieved sub-optimal per-epoch accuracy of 0.704 ± 0.097, Cohen's Kappa of 0.427 ± 0.178, and MMCC of 0.439 ± 0.180. The sub-optimal model obtained a MAB of 0.179, a significantly reduction of 71.0% compared to the proprietary Fitbit algorithm. We highlight the challenges in machine learning based sleep stage prediction with consumer wearables, and suggest directions for future research.
Article
Full-text available
Wearable, multisensor, consumer devices that estimate sleep are now commonplace, but the algorithms used by these devices to score sleep are not open source, and the raw sensor data is rarely accessible for external use. As a result, these devices are limited in their usefulness for clinical and research applications, despite holding much promise. We used a mobile application of our own creation to collect raw acceleration data and heart rate from the Apple Watch worn by participants undergoing polysomnography, as well as during the ambulatory period preceding in lab testing. Using this data, we compared the contributions of multiple features (motion, local standard deviation in heart rate, and “clock proxy”) to performance across several classifiers. Best performance was achieved using neural nets, though the differences across classifiers were generally small. For sleep-wake classification, our method scored 90% of epochs correctly, with 59.6% of true wake epochs (specificity) and 93% of true sleep epochs (sensitivity) scored correctly. Accuracy for differentiating wake, NREM sleep, and REM sleep was approximately 72% when all features were used. We generalized our results by testing the models trained on Apple Watch data using data from the Multi-ethnic Study of Atherosclerosis (MESA), and found that we were able to predict sleep with performance comparable to testing on our own dataset. This study demonstrates, for the first time, the ability to analyze raw acceleration and heart rate data from a ubiquitous wearable device with accepted, disclosed mathematical methods to improve accuracy of sleep and sleep stage prediction.
Article
Full-text available
Consumersleep trackingtechnologies offer anunobtrusive and cost-efficient waytomonitor sleepinfree-livingconditions.Technologicaladvancesinhardwareand software have significantly improved the functionality of the new gadgets that recently appeared in the market. However, whether the latest gadgets can provide valid measurements on overall sleep parameters and sleep structure such as deep and REM sleep has not been examined. In this study, we aimed to investigate the validity of the latest consumer sleep tracking devices including an activity wristband Fitbit Charge 2 and a wearable EEG-based eye mask Neuroon in comparison to a medical sleep monitor. First, we confirmed that Fitbit Charge 2 can automatically detect the onset and offset of sleep with reasonable accuracy. Second, analysis found that both consumer devices produced comparable results in measuring total sleep duration and sleep efficiency compared to the medical device. In addition, Fitbit accurately measured the number of awakenings, while Neuroon with good signal quality had satisfactory performance on total awake time and sleep onset latency. However, measuring sleep structure including light, deep, and REM sleep remains to be challenging for both consumer devices. Third,greater discrepancies were observed between Neuroon and the medical device in nights with more disrupted sleep and when the signal quality was poor, but no trend was observed in Fitbit Charge 2. This study suggests that current consumer sleep tracking technologies may be immature for diagnosing sleep disorders, but they are reasonably satisfactory for general purpose and non-clinical use.
Article
Full-text available
Getting enough quality sleep is a key part of a healthy lifestyle. Many people are tracking their sleep through mobile and wearable technology, together with contextual information that may influence sleep quality, like exercise, diet, and stress. However, there is limited support to help people make sense of this wealth of data, i.e., to explore the relationship between sleep data and contextual data. We strive to bridge this gap between sleep-tracking and sense-making through the design of SleepExplorer, a web-based tool that helps individuals understand sleep quality through multi-dimensional sleep structure and explore correlations between sleep data and contextual information. Based on a two-week field study with 12 participants, this paper offers a rich understanding on how technology can support sense-making on personal sleep data: SleepExplorer organizes a flux of sleep data into sleep structure, guides sleep-tracking activities, highlights connections between sleep and contributing factors, and supports individuals in taking actions. We discuss challenges and opportunities to inform the work of researchers and designers creating data-driven health and well-being applications.
Article
Full-text available
This paper introduces the two-sample Anderson-Darling (AD) test of goodness of fit as a tool for comparing distributions, response time distributions in particular. We discuss the problematic use of pooling response times across participants, and alternative tests of distributions, the most common being the Kolmogorov-Smirnoff (KS) test. We compare the KS test and the AD test, presenting conclusive evidence that the AD test is more powerful: when comparing two distributions that vary (1) in shift only, (2) in scale only, (3) in symmetry only, or (4) that have the same mean and standard deviation but differ on the tail ends only, the AD test proves to detect differences better than the KS test. In addition, the AD test has a type I error rate corresponding to alpha whereas the KS test is overly conservative. Finally, the AD test requires less data than the KS test to reach sufficient statistical power.
Article
The field of dataset shift has received a growing amount of interest in the last few years. The fact that most real-world applications have to cope with some form of shift makes its study highly relevant. The literature on the topic is mostly scattered, and different authors use different names to refer to the same concepts, or use the same name for different concepts. With this work, we attempt to present a unifying framework through the review and comparison of some of the most important works in the literature.