Conference Paper

SynRhythm: Learning a Deep Heart Rate Estimator from General to Specific

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... All rights reserved. Tulyakov et al. 2016;Wang et al. 2016;Hsu, Ambikapathi, and Chen 2017;Niu et al. 2018Niu et al. , 2019aQiu et al. 2018). These methods initially identified periodic changes in rPPG but struggled with weak signals due to artifacts and noise, requiring complex preprocessing and lacking effective contextual integration. ...
... Non-end-to-end methods included predefined facial ROIs in different color spaces to extract STmaps (Niu et al. 2018(Niu et al. , 2019b or their variants, and multi-scale STmaps (Lu, Han, and Zhou 2021;Niu et al. 2020), followed by cascaded CNNs (He et al. 2016) and RNNs (Cho et al. 2014) for feature extraction. However, these methods still required manual feature extraction and struggled with significant head movements. ...
... To fully verify PhysMamba's effectiveness and superiority in rPPG heart rate estimation, we designed a detailed evaluation process, including intra-dataset and cross-dataset evaluations. We compared traditional methods (Verkruysse, Svaasand, and Nelson 2008;Poh, McDuff, and Picard 2010b;De Haan and Jeanne 2013;Pilz et al. 2018;De Haan and Van Leest 2014;Wang et al. 2016) and advanced deep learning methods (Špetlík, Franc, and Matas 2018;Chen and McDuff 2018;Niu et al. 2018;Yu, Li, and Zhao 2019;Lee, Chen, and Lee 2020;Liu et al. 2020;Tsou et al. 2020;Song et al. 2021;Lu, Han, and Zhou 2021;Lokendra and Puneet 2022;Comas, Ruiz, and Sukno 2022;Yu et al. 2022;Liu et al. 2023;Gupta et al. 2023;Li, Yu, and Shi 2023;Lee et al. 2023;Zhang et al. 2023;Zou et al. 2024).The best results are in bold, and the second-best results are underlined. divided the dataset into training, validation, and test sets in a 7:1:2 ratio. ...
Preprint
Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing methods. To address this, we propose PhysMamba, a dual-stream time-frequency interactive model based on Mamba. PhysMamba integrates the state-of-the-art Mamba-2 model and employs a dual-stream architecture to learn diverse rPPG features, enhancing robustness in noisy conditions. Additionally, we designed the Cross-Attention State Space Duality (CASSD) module to improve information exchange and feature complementarity between the two streams. We validated PhysMamba using PURE, UBFC-rPPG and MMPD. Experimental results show that PhysMamba achieves state-of-the-art performance across various scenarios, particularly in complex environments, demonstrating its potential in practical remote heart rate monitoring applications.
... This step aims to obfuscate facial appearances while preserving the rPPG information. Since rPPG signals are spatially redundant at different facial regions and largely independent of spatial information as shown by [40,27], rPPG signals can be well preserved in this step while facial appearances are completely erased. The reasons for face de-identification are twofold. ...
... The cropped facial video v ∈ R T ×H×W ×3 , where T , H, and W are time length, height, and width, is downsampled by averaging the pixels in a sample region to get v d ∈ R T ×6×6×3 . It has been demonstrated that such downsampled facial videos are still effective in rPPG estimation [40,27]. Since rPPG signal extraction does not largely depend on spatial information [40], we further permutate the pixels to completely obfuscate the spatial information to get v de ∈ R T ×6×6×3 . ...
... Note that the permutation pattern is the same for each frame in a video but distinct for different videos. Since the spatial information is eliminated, we reshape the de-identified video v de into a spatiotemporal (ST) map M ∈ R 36×T ×3 for compact rPPG representation like [27]. ...
Preprint
Full-text available
Remote photoplethysmography (rPPG) is a non-contact method for measuring cardiac signals from facial videos, offering a convenient alternative to contact photoplethysmography (cPPG) obtained from contact sensors. Recent studies have shown that each individual possesses a unique cPPG signal morphology that can be utilized as a biometric identifier, which has inspired us to utilize the morphology of rPPG signals extracted from facial videos for person authentication. Since the facial appearance and rPPG are mixed in the facial videos, we first de-identify facial videos to remove facial appearance while preserving the rPPG information, which protects facial privacy and guarantees that only rPPG is used for authentication. The de-identified videos are fed into an rPPG model to get the rPPG signal morphology for authentication. In the first training stage, unsupervised rPPG training is performed to get coarse rPPG signals. In the second training stage, an rPPG-cPPG hybrid training is performed by incorporating external cPPG datasets to achieve rPPG biometric authentication and enhance rPPG signal morphology. Our approach needs only de-identified facial videos with subject IDs to train rPPG authentication models. The experimental results demonstrate that rPPG signal morphology hidden in facial videos can be used for biometric authentication. The code is available at https://github.com/zhaodongsun/rppg_biometrics.
... Moreover, collecting facial videos with physiological ground truth labels is costly because it requires medical devices and presents privacy concerns, leading to a severe lack of diversified rPPG data for training. The two main solutions to the data problem have been to create new data via data augmentation [29]- [33] and synthetic data [34]- [36], or to utilize real data without any labels via self-supervised learning (SSL) methods. Most SSL methods rely on generating contrastive positive and negative samples from data. ...
... Another approach is to train completely on synthetic data. In [34] the authors constructed a dataset of synthetic spatial-temporal maps. Synthetic avatars were also proposed in [35], where synthetic videos with underlying physiological signals were generated. ...
Article
Full-text available
Remote photoplethysmography (rPPG) uses RGB facial videos to measure cardiac signals. It holds promise for future applications in telemedicine, affective computing, liveness-based face anti-spoofing, driver monitoring, etc. Supervised deep learning methods have been leading in performance but are severely limited by data availability, as recording face videos with ground truth physiological signals is expensive. Recent self-supervised methods aim to solve the data issue but struggle to learn robust features from data in challenging scenarios. These scenarios are characterized by overwhelming environmental noise caused by head movements, illumination variations, and recording device changes. We propose RS+rPPG, a novel contrastive method that effectively leverages a large set of eleven rPPG priors, enabling strong self-supervision even with challenging data. RS+rPPG comprehensively exploits intra-data and interdata information present in videos via diverse augmentations and learning constraints. We extensively experimented on seven rPPG datasets and demonstrated that RS+rPPG can outperform state-of-the-art supervised methods without using any labels. Additionally, we demonstrate the high generalization capability, demographic fairness, and mixed-data stability of our method.
... Taking into account the temporal trace of rPPG signals obtained from independent or uncorrelated signal sources under specific assumptions, researchers have proposed various signal decomposition methods for HR measurement [2], [3], [23], [25], [26], [27], [28], [29], [30], [31]. Poh et al. [29] assumed that the RGB channels were independent components and used ICA to temporally filter the RGB channels for obtaining rPPG signals. ...
... Poh et al. [29] assumed that the RGB channels were independent components and used ICA to temporally filter the RGB channels for obtaining rPPG signals. Some studies utilized the green channel to extract the rPPG signal as it exhibited the strongest rPPG signal [30], [31]. Lewandowska and Nowak [25] presented a new framework for HR measurement based on the PCA method. ...
Article
Remote photoplethysmography (rPPG) is an essential way of monitoring the physiological indicator heart rate (HR), which has important guiding significance for preventing and controlling cardiovascular diseases. However, most existing HR measurement approaches require ideal illumination conditions, and the illumination variation in a realistic situation is complicated. In view of this issue, this paper proposes a robust HR measurement method to reduce performance degradation due to unstable illumination in facial videos. Specifically, two complementary color spaces (RGB and Multi-Scale Retinex (MSR)) are abundantly utilized by exploring the potential of space-shared information and space-specific characteristics. Subsequently, the time-space Transformer equipped with sequential feature aggregation (TST-SFA) is exploited to extract physiological signal features. In addition, a novel optimization strategy for model learning, including affinity variation, discrepancy, and task losses, is proposed to train the whole algorithm in an end-to-end manner jointly. Experimental results on three public datasets show that our proposed method outperforms other approaches and can achieve more accurate HR measurement under different illuminations. The code will be released at https://github.com/Llili314/IRHrNet.
... Early research on rPPG primarily relied on traditional signal processing methods to recover weak rPPG signals from facial videos (Verkruysse, Svaasand, and Nelson 2008;Poh, McDuff, and Picard 2010;De Haan and Jeanne 2013;Wang et al. 2016). In recent years, data-driven approaches have dominated due to their remarkable performance, showcasing a trend in the transition of backbone from 2D CNNs (Špetlík, Franc, and Matas 2018;Niu et al. 2018;Chen and McDuff 2018;Niu et al. 2020;Liu et al. 2020) to 3D CNNs Zhao et al. 2021; Figure 2: Performance and efficiency evaluation for intradataset testing on MMPD. The diameter of the circle indicates the peak GPU memory. ...
Article
Remote photoplethysmography (rPPG) is a method for non-contact measurement of physiological signals from facial videos, holding great potential in various applications such as healthcare, affective computing, and anti-spoofing. Existing deep learning methods struggle to address two core issues of rPPG simultaneously: understanding the periodic pattern of rPPG among long contexts and addressing large spatiotemporal redundancy in video segments. These represent a trade-off between computational complexity and the ability to capture long-range dependencies. In this paper, we introduce RhythmMamba, a state space model-based method that captures long-range dependencies while maintaining linear complexity. By viewing rPPG as a time series task through the proposed frame stem, the periodic variations in pulse waves are modeled as state transitions. Additionally, we design multi-temporal constraint and frequency domain feed-forward, both aligned with the characteristics of rPPG time series, to improve the learning capacity of Mamba for rPPG signals. Extensive experiments show that RhythmMamba achieves state-of-the-art performance with 319% throughput and 23% peak GPU memory.
... Pulse Signal estimation can be separated into signal processing-based methods [5], [8], [14]- [17], [23] and deep neural network methods [3], [4], [7], [24]- [32]. We discuss each individually below. ...
Preprint
Remote estimation of vital signs enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive. In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available datasets.Our imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation. Unlike many deep learning methods that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. The pulse signal estimation module, which we call TURNIP (Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography), allows the system to faithfully reconstruct the underlying pulse signal waveform and uses it to measure heart rate and pulse rate variability metrics, even in the presence of motion. When parts of the face are occluded due to extreme head poses, our system explicitly detects such "self-occluded" regions and maintains estimation robustness despite the missing information. Our algorithm provides reliable heart rate estimates without the need for specialized sensors or contact with the skin, outperforming previous iPPG methods on both color (RGB) and near-infrared (NIR) datasets.
... The corresponding performance results are shown in Table 1. It can be observed that while the baseline outperforms some traditional heart rate estimation methods like POS, CHROM, and modern DNN-based supervised methods such as SynRhythm [34] and Meta-rPPG [15], it falls short compared to other deep learning methods like RADIANT and Deepphys. After integrating our module, the network achieved better results across MAE, RMSE, and r, demonstrating the effectiveness of our approach in restoring signal periodicity. ...
Article
Full-text available
Remote physiological measurement estimates heart rate (HR) non-contact by analyzing skin color changes from facial videos. This non-intrusive method can be useful in healthcare, border security, and lie detection. These changes are subtle, imperceptible to the naked eye, and easily affected by variations in lighting and motion artifacts from the subject in front of the camera. Traditional methods often use aggressive noise reduction techniques in the presence of real-world noise, which can obscure heart rate information and lead to distorted detection. This study proposes a more effective method for remote heart rate detection that addresses interference by enhancing facial spatial geometric features using the progressive transformation properties of Recursive Spatial Transformation (ReST). It also integrates motion features obtained from optical flow for more stable spatio-temporal feature enhancement. Additionally, we introduce a Multi-Scale Temporal Convolution Module (TDCM) to capture periodic signal changes across different time scales, modeling the periodicity of heart rate signals from various scales to achieve robust recovery of rPPG signals. The entire model has about 30% of the parameters of PhysFormer. Experiments on multiple remote physiological signal measurement datasets demonstrate that the proposed method significantly improves heart rate estimation across various metrics, particularly showing strong robustness in handling videos with severe head movements.
... In recent years, deep learning-based methods [67], [79], [54], [31], [43], [48] have surpassed conventional techniques, achieving state-of-the-art performance in estimating vital signs from facial videos. Some combine knowledge from traditional methods with Convolutional Neural Networks (CNNs) to exploit sophisticated features [46], [47], [66]. Recent works like [31], [40] explore unsupervised approaches using metalearning, demonstrating improved generalization in out-ofdistribution cases. ...
Article
Full-text available
Despite the recent advances in remote heart rate measurement, most improvements primarily focus on recovering the rPPG signal, often overlooking the inherent challenges of estimating heart rate (HR) from the derived signal. Furthermore, most existing methods adopt the average HR per video to assess model performance, thus relying on rather large temporal windows to produce a single estimate; this hampers their applicability to scenarios in which the continuous monitoring of a patient’s physiological status is crucial. Besides, this evaluation approach can also lead to biased performance assessments due to low continuous precision, as it considers only the mean value of the entire video. In this paper, we present the PulseFormer, a novel continuous deep estimator for remote HR. Our proposed method utilizes a time-frequency attention block that leverages the enhanced resolution properties of the Chirp-Z Transform (CZT) to accurately estimate HR from the recovered low-resolution signal using a reduced temporal window size. We validate the effectiveness of our model on the large-scale Vision-for-Vitals (V4V) benchmark, designed for continuous physiological signals estimation from facial videos. The results reveal outstanding frame-to-frame HR estimation capabilities, establishing the proposed approach as a robust and versatile estimator that could be used with any rPPG method.
... We first conduct intra-dataset HR estimation on four datasets. As shown in Table 1, we compare the proposed VL-phys with three traditional methods (POS (Wang et al., 2017), CHROM (De Haan & Jeanne, 2013) and Green (Verkruysse et al., 2008)); eight DNN-based supervised methods (SynRhythm (Niu et al., 2018), Meta-rppg Lee et al. (2020a), PulseGan (Song et al., 2021), Dual-Gan (Lu et al., 2021), Physformer (Yu et al., 2022b), Du et al. (2023), Dual-TL (Qian et al., 2024), and ND-DeeprPPG (Liu & Yuen, 2024)); four DNNbased self-supervised methods (Gideon et al. Gideon and Stent (2021), Contrast-Phys+ (Sun & Li, 2024), Yue et al. (Yue et al., 2023), and SiNC Speth et al. (2023)). ...
Article
Full-text available
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel frequency-centric self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of frequency-related generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods. Our codes will be available at https://github.com/yuezijie/Bootstrapping-VLM-for-Frequency-centric-Self-supervised-Remote-Physiological-Measurement.
... This capability allows the detection of subtle differences between frames, which is the operational principle of some end-to-end models based on convolutional algorithms. Additionally, approaches using Spatiotemporal Maps (STMap) [24,25,26,19,47,30] craft three-dimensional spatiotemporal features into two-dimensional feature maps processed through convolutional networks. This method reduces the dimensionality of network input but increases preprocessing complexity. ...
Preprint
Full-text available
Remote photoplethysmography (rPPG) extracts PPG signals from subtle color changes in facial videos, showing strong potential for health applications. However, most rPPG methods rely on intensity differences between consecutive frames, missing long-term signal variations affected by motion or lighting artifacts, which reduces accuracy. This paper introduces Temporal Normalization (TN), a flexible plug-and-play module compatible with any end-to-end rPPG network architecture. By capturing long-term temporally normalized features following detrending, TN effectively mitigates motion and lighting artifacts, significantly boosting the rPPG prediction performance. When integrated into four state-of-the-art rPPG methods, TN delivered performance improvements ranging from 34.3% to 94.2% in heart rate measurement tasks across four widely-used datasets. Notably, TN showed even greater performance gains in smaller models. We further discuss and provide insights into the mechanisms behind TN's effectiveness.
... Constantly evolving deep learning (DL) techniques have laid the foundation for more accurate and robust visual heart rate sensing [17,70]. Specifically, CNN-based approaches utilize feature extraction networks to obtain high-quality rPPG signals and then estimate heart rates based on signal features, mainly including 2D CNN [11,50,57] and 3D CNN [61,83]. To capture richer temporal information of rPPG, RNN have been incorporated into heart rate sensing to mitigate sensing data noise [5], model the correlation between adjacent frames [58], and extract global temporal information from continuous facial video frames [34]. ...
Preprint
Ubiquitous on-device heart rate sensing is vital for high-stress individuals and chronic patients. Non-contact sensing, compared to contact-based tools, allows for natural user monitoring, potentially enabling more accurate and holistic data collection. However, in open and uncontrolled mobile environments, user movement and lighting introduce. Existing methods, such as curve-based or short-range deep learning recognition based on adjacent frames, strike the optimal balance between real-time performance and accuracy, especially under limited device resources. In this paper, we present UbiHR, a ubiquitous device-based heart rate sensing system. Key to UbiHR is a real-time long-range spatio-temporal model enabling noise-independent heart rate recognition and display on commodity mobile devices, along with a set of mechanisms for prompt and energy-efficient sampling and preprocessing. Diverse experiments and user studies involving four devices, four tasks, and 80 participants demonstrate UbiHR's superior performance, enhancing accuracy by up to 74.2\% and reducing latency by 51.2\%.
... With powerful modeling capacity of deep learning (DL), there's been a significant shift towards employing DL models for remote physiological estimation, yielding impressive results. [4,16,41,43,52,66,69]. Many of these methods extract BVP signals from face videos typically by aggregating spatial and temporal information [44,52,69,70]. ...
Preprint
Remote photoplethysmography (rPPG) is gaining prominence for its non-invasive approach to monitoring physiological signals using only cameras. Despite its promise, the adaptability of rPPG models to new, unseen domains is hindered due to the environmental sensitivity of physiological signals. To address this, we pioneer the Test-Time Adaptation (TTA) in rPPG, enabling the adaptation of pre-trained models to the target domain during inference, sidestepping the need for annotations or source data due to privacy considerations. Particularly, utilizing only the user's face video stream as the accessible target domain data, the rPPG model is adjusted by tuning on each single instance it encounters. However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present Bi-TTA, a novel expert knowledge-based Bidirectional Test-Time Adapter framework. Specifically, leveraging two expert-knowledge priors for providing self-supervision, our Bi-TTA primarily comprises two modules: a prospective adaptation (PA) module using sharpness-aware minimization to eliminate domain-irrelevant noise, enhancing the stability and efficacy during the adaptation process, and a retrospective stabilization (RS) module to dynamically reinforce crucial learned model parameters, averting performance degradation caused by overfitting or catastrophic forgetting. To this end, we established a large-scale benchmark for rPPG tasks under TTA protocol. The experimental results demonstrate the significant superiority of our approach over the state-of-the-art.
... The first pair of videos (one in the visible spectrum and its paired thermal counterpart) was captured after the subject had been resting for at least 5 min, while the second pair followed moderate exercise in the form of climbing stairs to elevate their HR values, resulting in a total of 408 60-s videos. As for video length, several studies have highlighted the feasibility of estimating the PPG signal, leading to successful HR and BP estimation (among other parameters), from video lengths ranging from 5 [27] to 60 s [28,29]. ...
Article
Full-text available
In recent years, the estimation of biometric parameters from facial visuals, including images and videos, has emerged as a prominent area of research. However, the robustness of deep learning-based models is challenged, particularly in the presence of changing illumination conditions. To overcome these limitations and unlock new opportunities, thermal imagery has arisen as a viable alternative. Nevertheless, the limited availability of datasets containing thermal data and the small amount of annotations on them limits the exploration of this spectrum. Motivated by this gap, this paper introduces the Label-EURECOM Visible and Thermal (LVT) Face Dataset for face biometrics. This pioneering dataset includes paired visible and thermal images and videos from 52 subjects along with metadata of 22 soft biometrics and health parameters. Due to the reduced number of existing datasets in this domain, the LVT Face Dataset aims to facilitate further research and advancements in the utilization of thermal imagery for diverse eHealth applications and soft biometric estimation. Moreover, we present the first comparative study between visible and thermal spectra as input images for soft biometric estimation, namely gender age and weight, from face images on our collected dataset.
... Deep learning-based methods [24,31,39,42,48,63] have surpassed conventional techniques, achieving state-of-the-art performance in estimating vital signs from facial videos. Some combine traditional methods with Convolutional Neural Networks (CNNs) to leverage advanced features [34,36,47]. Other recent works [24,29] explore unsupervised approaches using meta-learning, enhancing generalization in out-of-distribution cases. ...
Preprint
Full-text available
In recent years, deep learning methods have shown impressive results for camera-based remote physiological signal estimation, clearly surpassing traditional methods. However, the performance and generalization ability of Deep Neural Networks heavily depends on rich training data truly representing different factors of variation encountered in real applications. Unfortunately, many current remote photoplethysmography (rPPG) datasets lack diversity, particularly in darker skin tones, leading to biased performance of existing rPPG approaches. To mitigate this bias, we introduce PhysFlow, a novel method for augmenting skin diversity in remote heart rate estimation using conditional normalizing flows. PhysFlow adopts end-to-end training optimization, enabling simultaneous training of supervised rPPG approaches on both original and generated data. Additionally, we condition our model using CIELAB color space skin features directly extracted from the facial videos without the need for skin-tone labels. We validate PhysFlow on publicly available datasets, UCLA-rPPG and MMPD, demonstrating reduced heart rate error, particularly in dark skin tones. Furthermore, we demonstrate its versatility and adaptability across different data-driven rPPG methods.
... Therefore, we convert the human HR range (i.e., 40 to 250 bpm) into the corresponding frequency range f of [0.66, 4.16] Hz and then utilize the simplified formulation from [20] to synthesize s by, ...
Preprint
Full-text available
Many remote photoplethysmography (rPPG) estimation models have achieved promising performance on the training domain but often fail to measure the physiological signals or heart rates (HR) on test domains. Domain generalization (DG) or domain adaptation (DA) techniques are therefore adopted in the offline training stage to adapt the model to the unobserved or observed test domain by referring to all the available source domain data. However, in rPPG estimation problems, the adapted model usually confronts challenges of estimating target data with various domain information, such as different video capturing settings, individuals of different age ranges, or of different HR distributions. In contrast, Test-Time Adaptation (TTA), by online adapting to unlabeled target data without referring to any source data, enables the model to adaptively estimate rPPG signals of various unseen domains. In this paper, we first propose a novel TTA-rPPG benchmark, which encompasses various domain information and HR distributions, to simulate the challenges encountered in rPPG estimation. Next, we propose a novel synthetic signal-guided rPPG estimation framework with a two-fold purpose. First, we design an effective spectral-based entropy minimization to enforce the rPPG model to learn new target domain information. Second, we develop a synthetic signal-guided feature learning, by synthesizing pseudo rPPG signals as pseudo ground-truths to guide a conditional generator to generate latent rPPG features. The synthesized rPPG signals and the generated rPPG features are used to guide the rPPG model to broadly cover various HR distributions. Our extensive experiments on the TTA-rPPG benchmark show that the proposed method achieves superior performance and outperforms previous DG and DA methods across most protocols of the proposed TTA-rPPG benchmark.
... Before the advent of DL, several ML methods were used to remotely estimate HR, including linear regression, k-nearest neighbor (kNN) classifier, support-vector regression, adaptive hidden Markov models, and a general-to-specific transfer learning strategy named SynRhythm (Hsu et al., 2014;Monkaresi et al., 2014;Fan et al., 2015;Niu et al., 2018). As with many CV and signal processing applications, DL methods have shown promise in mapping complex physiological processes for contactless HR measurement. ...
Article
Full-text available
In recent decades, there has been ongoing development in the application of computer vision (CV) in the medical field. As conventional contact-based physiological measurement techniques often restrict a patient’s mobility in the clinical environment, the ability to achieve continuous, comfortable and convenient monitoring is thus a topic of interest to researchers. One type of CV application is remote imaging photoplethysmography (rPPG), which can predict vital signs using a video or image. While contactless physiological measurement techniques have an excellent application prospect, the lack of uniformity or standardization of contactless vital monitoring methods limits their application in remote healthcare/telehealth settings. Several methods have been developed to improve this limitation and solve the heterogeneity of video signals caused by movement, lighting, and equipment. The fundamental algorithms include traditional algorithms with optimization and developing deep learning (DL) algorithms. This article aims to provide an in-depth review of current Artificial Intelligence (AI) methods using CV and DL in contactless physiological measurement and a comprehensive summary of the latest development of contactless measurement techniques for skin perfusion, respiratory rate, blood oxygen saturation, heart rate, heart rate variability, and blood pressure.
... We first conduct intra-dataset HR estimation on four datasets. As shown in Table 1, we compare the proposed VL-phys with three traditional methods (POS [12], CHROM [14] and Green [3]); eight DNN-based supervised methods (SynRhythm [48], Meta-rppg [49], PulseGan [50], Dual-Gan [5], Physformer [6], Du et al. [7], Dual-TL [51], and ND-DeeprPPG [52]); four DNN-based self-supervised methods (Gideon et al. [22], Contrast-Phys+ [23], Yue et al. [21], and SiNC [53]). First, it is evident that traditional methods exhibit poor performance across four datasets. ...
Preprint
Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
... Since contactless methods are inherently susceptible to noise such as illumination changes and head movements [24], a spatial-averaging operation is generally performed on the region of interest (face) to enhance the quality of the extracted signal. Niu et al. [52] proposed an rPPG-based spatial-temporal representation, spatial-temporal map (STMap), that is widely used for HR estimation as well as face anti-spoofing [39,52,[65][66][67][68]. The STMap, a low-dimensional spatial-temporal representation in which physiological information of the original video is embedded, can be directly fed into a CNN, which learns and develops a function for mapping a connection between the STMap and the output vital sign. ...
Article
Full-text available
Blood oxygen saturation (SpO2) is an essential physiological parameter for evaluating a person’s health. While conventional SpO2 measurement devices like pulse oximeters require skin contact, advanced computer vision technology can enable remote SpO2 monitoring through a regular camera without skin contact. In this paper, we propose novel deep learning models to measure SpO2 remotely from facial videos and evaluate them using a public benchmark database, VIPL-HR. We utilize a spatial–temporal representation to encode SpO2 information recorded by conventional RGB cameras and directly pass it into selected convolutional neural networks to predict SpO2. The best deep learning model achieves 1.274% in mean absolute error and 1.71% in root mean squared error, which exceed the international standard of 4% for an approved pulse oximeter. Our results significantly outperform the conventional analytical Ratio of Ratios model for contactless SpO2 measurement. Results of sensitivity analyses of the influence of spatial–temporal representation color spaces, subject scenarios, acquisition devices, and SpO2 ranges on the model performance are reported with explainability analyses to provide more insights for this emerging research field.
Article
Remote photoplethysmography (rPPG) offers significant potential for health monitoring and emotional analysis through non-contact physiological measurement from facial videos. However, noise remains a crucial challenge, limiting the generalizability of current rPPG methods. This paper introduces Diffusion-Phys, a novel framework using diffusion models for robust heart rate (HR) estimation from facial videos. Diffusion-Phys employs Multi-scale Spatial-Temporal Maps (MSTmaps) to preprocess input data and introduces Gaussian noise to simulate real-world conditions. The model is trained using a denoising network for accurate HR estimation. Experimental evaluations on the VIPL-HR, UBFC-rPPG and PURE datasets demonstrate that Diffusion-Phys achieves comparable or superior performance to state-of-the-art methods, with lower computational complexity. These results highlight the effectiveness of explicitly addressing noise through diffusion modeling, improving the reliability and generalization of non-contact physiological measurement systems.
Preprint
Remote photoplethysmography (rPPG) technology infers heart rate by capturing subtle color changes in facial skin using a camera, demonstrating great potential in non-contact heart rate measurement. However, measurement accuracy significantly decreases in complex scenarios such as lighting changes and head movements compared to ideal laboratory conditions. Existing deep learning models often neglect the quantification of measurement uncertainty, limiting their credibility in dynamic scenes. To address the issue of insufficient rPPG measurement reliability in complex scenarios, this paper introduces Bayesian neural networks to the rPPG field for the first time, proposing the Robust Fusion Bayesian Physiological Network (RF-BayesPhysNet), which can model both aleatoric and epistemic uncertainty. It leverages variational inference to balance accuracy and computational efficiency. Due to the current lack of uncertainty estimation metrics in the rPPG field, this paper also proposes a new set of methods, using Spearman correlation coefficient, prediction interval coverage, and confidence interval width, to measure the effectiveness of uncertainty estimation methods under different noise conditions. Experiments show that the model, with only double the parameters compared to traditional network models, achieves a MAE of 2.56 on the UBFC-RPPG dataset, surpassing most models. It demonstrates good uncertainty estimation capability in no-noise and low-noise conditions, providing prediction confidence and significantly enhancing robustness in real-world applications. We have open-sourced the code at https://github.com/AIDC-rPPG/RF-Net
Article
Full-text available
Remote photoplethysmography (rPPG) estimation has made considerable progress by leveraging deep learning, yet its performance remains highly susceptible to the domain shifts caused by lighting, skin tone and/or movement, particularly during inference. Additionally, continuous adaptation across multiple domains is essential due to the dynamic environmental changes such as lighting transitions or continuous motions. Domain adaptation has been widely investigated mostly focusing on classification tasks with labels. Prior arts to address domain shifts in rPPG estimation, a regression task, relies on labeled data, requires pre-training on target domains, and focuses on single-domain test-time adaptation (TTA). However, there are still remaining challenges in TTA for their applicability of rPPG estimation in real-world scenarios such as no label during inference, continuous adaptation over multiple domains, and catastrophic forgetting when re-adapting to the source domain. In this work, we recast a rPPG TTA problem as a continual learning and propose an efficient continual TTA method that mitigates significant domain shifts in target domains without labels by leveraging the non-contrastive unsupervised learning loss with selectively updating only the batch normalization layers as well as alleviates catastrophic forgetting in source domain by adopting the learning without forgetting (LwF) regularization in the frequency domain. Our method without target labels consistently yielded performance improvements in challenging continual adaptation scenarios including adapting to multiple new domain datasets over several cycles. This approach does not only mitigate catastrophic forgetting in source domain, but also ensures robust performance across different domains.
Preprint
Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment in surgical settings, affective computing, and more. iPPG involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. Some previous iPPG methods impose model-based sparse priors on the pulse signals and use iterative optimization for pulse wave recovery, while others use end-to-end black-box deep learning methods. In contrast, we introduce methods that combine signal processing and deep learning methods in an inverse problem framework. Our methods estimate the underlying pulse signal and heart rate from facial video by learning deep-network-based denoising operators that leverage deep algorithm unfolding and deep equilibrium models. Experiments show that our methods can denoise an acquired signal from the face and infer the correct underlying pulse rate, achieving state-of-the-art heart rate estimation performance on well-known benchmarks, all with less than one-fifth the number of learnable parameters as the closest competing method.
Article
Drowsy driving is a major contributor to traffic accidents, making real-time monitoring of driver drowsiness essential for effective preventive measures. This paper presents a novel method for detecting driver drowsiness through facial video analysis and non-contact heart rate measurement. To address the challenges posed by varying lighting conditions, the algorithm integrates RGB (red, green, and blue) and multi-scale reinforced image color space techniques. This combination enhances the robustness of heart rate signal extraction by generating spatio-temporal maps that minimize the impact of low light. A convolutional neural network is used to accurately map these spatio-temporal features to their corresponding heart rate values. To provide a comprehensive assessment of drowsiness, a differential thresholding method is utilized to extract heart rate variability information. Building on this data, a dynamic drowsiness assessment model is developed using long short-term memory networks. Evaluation results on the corresponding dataset demonstrate a high accuracy rate of 95.1%, underscoring the method’s robustness, which means it can greatly enhance the reliability of drowsiness detection systems, ultimately contributing to a reduction in traffic accidents caused by driver fatigue.
Article
Full-text available
Over the past decade, Remote Photoplethysmography (rPPG) has emerged as an unobtrusive alternative to wearable sensors. Despite advancements, its real-world scalability remains limited due to the absence of standardized benchmark methods for validation. This lack of standardization complicates proper comparisons between different approaches, creating inconsistencies in performance evaluation. To addres this, we conducted a comprehensive review of recent rPPG methods, analyzing their pre- and post-processing algorithms, validation procedures, benchmark algorithms, datasets, evaluation metrics, data segmentation, and reported results. Our findings demonstrate significant variability in the reported Mean Absolute Error (MAE) of benchmark rPPG methods applied to the same public datasets, confirming the challenge of inconsistent evaluation. By examining the original implementations of established benchmark methods, we developed a flexible framework that optimally selects pre- and post-processing algorithms through an exhaustive search. Applying this framework to benchmark algorithms across three public datasets, we found that 80% of the refined methods ranked within the top 25th percentile in MAE, RMSE, and PCC, with 60% surpassing the highest reported accuracies. These refined methods provide a more rigorous foundation for evaluating novel rPPG techniques, addressing the standardization gap in the field. The codebase for this framework (frPPG) is available at [https://github.com/Building-Robotics-Lab/flexible_rPPG], offering a valuable tool for designing and benchmarking rPPG methods against the best-performing algorithms on a given dataset.
Article
Full-text available
Remote photoplethysmography (rPPG) has attracted growing attention due to its non-contact nature. However, existing non-contact heart rate detection methods are often affected by noise from motion artifacts and changes in lighting, which can lead to a decrease in detection accuracy. To solve this problem, this paper initially employs manual extraction to precisely define the facial Region of Interest (ROI), expanding the facial area while avoiding rigid regions such as the eyes and mouth to minimize the impact of motion artifacts. Additionally, during the training phase, illumination normalization is employed on video frames with uneven lighting to mitigate noise caused by lighting fluctuations. Finally, this paper introduces a 3D convolutional neural network (CNN) method incorporating an attention mechanism for heart rate detection from facial videos. We optimize the traditional 3D-CNN to capture global features in spatiotemporal data more effectively. The SimAM attention mechanism is introduced to enable the model to precisely focus on and enhance facial ROI feature representations. Following the extraction of rPPG signals, a heart rate estimation network using a bidirectional long short-term memory (BiLSTM) model is employed to derive the heart rate from the signals. The method introduced here is experimentally validated on two publicly available datasets, UBFC-rPPG and PURE. The mean absolute errors were 0.24 bpm and 0.65 bpm, the root mean square errors were 0.63 bpm and 1.30 bpm, and the Pearson correlation coefficients reached 0.99, confirming the method’s reliability. Comparisons of predicted signals with ground truth signals further validated its accuracy.
Preprint
Full-text available
Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.
Article
Full-text available
Remote photo-plethysmography (rPPG) is a useful camera-based health motioning method that can measure the heart rhythm from facial videos. Many well-established deep learning models can provide highly accurate and robust results in measuring heart rate (HR) and heart rate variability (HRV). However, these methods are unable to effectively eliminate illumination variation and motion artifact disturbances, and their substantial computational resource requirements significantly limit their applicability in real-world scenarios. Hence, we propose a lightweight multi-frequency network named MFF-Net to measure heart rhythm via facial videos in a short time. Firstly, we propose a multi-frequency mode signal fusion (MFF) mechanism, which can separate the characteristics of different modes of the original rPPG signals and send them to a processor with independent parameters, helping the network recover blood volume pulse (BVP) signals accurately under a complex noise environment. In addition, in order to help the network extract the characteristics of different modal signals effectively, we designed a temporal multiscale convolution module (TMSC-module) and spectrum self-attention module (SSA-module). The TMSC-module can expand the receptive field of the signal-refining network, obtain more abundant multiscale information, and transmit it to the signal reconstruction network. The SSA-module can help a signal reconstruction network locate the obvious inferior parts in the reconstruction process so as to make better decisions when merging multi-dimensional signals. Finally, in order to solve the over-fitting phenomenon that easily occurs in the network, we propose an over-fitting sampling training scheme to further improve the fitting ability of the network. Comprehensive experiments were conducted on three benchmark datasets, and we estimated HR and HRV based on the BVP signals derived by MFF-Net. Compared with state-of-the-art methods, our approach achieves better performance both on HR and HRV estimation with lower computational burden. We can conclude that the proposed MFF-Net has the opportunity to be applied in many real-world scenarios.
Article
Ubiquitous on-device heart rate sensing is vital for high-stress individuals and chronic patients. Non-contact sensing, compared to contact-based tools, allows for natural user monitoring, potentially enabling more accurate and holistic data collection. However, in open and uncontrolled mobile environments, user movement and lighting introduce noises. Existing methods, such as curve-based or short-range deep learning recognition based on adjacent frames, strike the optimal balance between real-time performance and accuracy, especially under limited device resources. In this paper, we present UbiHR, a ubiquitous device-based heart rate sensing system. Key to UbiHR is a real-time long-range spatio-temporal model enabling noise-independent heart rate recognition and display on commodity mobile devices, along with a set of mechanisms for prompt and energy-efficient sampling and preprocessing. Diverse experiments and user studies involving four devices, four tasks, and 80 participants demonstrate UbiHR's superior performance, enhancing accuracy by up to 74.2% and reducing latency by 51.2%.
Preprint
Full-text available
Remote physiological measurement estimates heart rate (HR) non-contact by analyzing skin color changes from facial videos. This non-intrusive method can be useful in healthcare, border security, and lie detection.These changes are subtle, imperceptible to the naked eye, and easily affected by variations in lighting and motion artifacts from the subject in front of the camera.Traditional methods often use aggressive noise reduction techniques in the presence of real-world noise, which can obscure heart rate information and lead to distorted detection.This study proposes a more effective method for remote heart rate detection that addresses interference by enhancing facial spatial geometric features using the progressive transformation properties of Recursive Spatial Transformation (ReST). It also integrates motion features obtained from optical flow for more stable spatiotemporal feature enhancement. Additionally, we introduce a Multi-Scale Temporal Convolution Module (TDCM) to capture periodic signal changes across different time scales, modeling the periodicity of heart rate signals from various scales to achieve robust recovery of rPPG signals.The entire model has only 30% of the parameters of PhysFormer. Experiments on multiple remote physiological signal measurement datasets demonstrate that the proposed method significantly improves heart rate estimation across various metrics, particularly showing strong robustness in handling videos with severe head movements.
Article
This paper presents a deep-learning-based two-stream network to estimate remote Photoplethysmogram (rPPG) signal and hence derive the heart rate from an RGB facial video. Our proposed network employs temporal modulation blocks (TMBs) to efficiently extract temporal dependencies and spatial attention blocks on a mean frame to learn spatial features. Our TMBs are composed of two sub-blocks that can simultaneously learn overall and channel-wise spatio-temporal features, which are pivotal for the task. Data augmentation in training and multiple redundant estimations for noise removal in testing were also designed to make the training more effective and the inference more robust. Experimental results show that the proposed Temporal Shift-Channel-wise Spatio-Temporal Network (TS-CST Net) has reached competitive and even superior performances among the state-of-the-art methods on four popular datasets, showcasing our network’s learning capability.
Article
Remote photoplethysmography (rPPG) is a non-invasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similarity prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic temporal self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to disentangle more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm for self-similarity-aware learning and rPPG signal decoupling. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD has the lowest inference time and computation cost among end-to-end models.
Article
Unlabelled: Objective. Monitoring changes in human heart rate variability (HRV) holds significant importance for protecting life and health. Studies have shown that Imaging Photoplethysmography (IPPG) based on ordinary color cameras can detect the color change of the skin pixel caused by cardiopulmonary system. Most researchers employed deep learning IPPG algorithms to extract the blood volume pulse (BVP) signal, analyzing it predominantly through the heart rate (HR). However, this approach often overlooks the inherent intricate time-frequency domain characteristics in the BVP signal, which cannot be comprehensively deduced solely from HR. The analysis of HRV metrics through the BVP signal is imperative. Approach: In this paper, the transformation invariant loss function with distance equilibrium (TIDLE) loss function is applied to IPPG for the first time, and the details of BVP signal can be recovered better. In detail, TIDLE is tested in four commonly used IPPG deep learning models, which are DeepPhys, EfficientPhys, Physnet and TS_CAN, and compared with other three loss functions, which are mean absolute error (MAE), mean square error (MSE), Neg Pearson Coefficient correlation (NPCC). Main results: The experiments demonstrate that MAE and MSE exhibit suboptimal performance in predicting LF/HF across the four models, achieving the Statistic of Mean Absolute Error (MAES) of 25.94% and 34.05%, respectively. In contrast, NPCC and TIDLE yielded more favorable results at 13.51% and 11.35%, respectively. Taking into consideration the morphological characteristics of the BVP signal, on the two optimal models for predicting HRV metrics, namely DeepPhys and TS_CAN, the Pearson coefficients for the BVP signals predicted by TIDLE in comparison to the gold-standard BVP signals achieved values of 0.627 and 0.605, respectively. In contrast, the results based on NPCC were notably lower, at only 0.545 and 0.533, respectively. Significance: This paper contributes significantly to the effective restoration of the morphology and frequency domain characteristics of the BVP signal.
Conference Paper
Full-text available
With the wide applications of user authentication based on face recognition, face spoof attacks against face recognition systems are drawing increasing attentions. While emerging approaches of face an-tispoofing have been reported in recent years, most of them limit to the non-realistic intra-database testing scenarios instead of the cross-database testing scenarios. We propose a robust representation integrating deep texture features and face movement cue like eye-blink as countermeasures for presentation attacks like photos and replays. We learn deep texture features from both aligned facial images and whole frames, and use a frame difference based approach for eye-blink detection. A face video clip is classified as live if it is categorized as live using both cues. Cross-database testing on public-domain face databases shows that the proposed approach significantly outperforms the state-of-the-art.
Conference Paper
Full-text available
Heart rate is an important indicator of people's physiological state. Recently, several papers reported methods to measure heart rate remotely from face videos. Those methods work well on stationary subjects under well controlled conditions, but their performance significantly degrades if the videos are recorded under more challenging conditions, specifically when subjects' motions and illumination variations are involved. We propose a framework which utilizes face tracking and Normalized Least Mean Square adaptive filtering methods to counter their influences. We test our framework on a large difficult and public database MAHNOB-HCI and demonstrate that our method substantially outperforms all previous methods. We also use our method for long term heart rate monitoring in a game evaluation scenario and achieve promising results.
Conference Paper
Full-text available
The ability to remotely measure heart rate from videos without requiring any special setup is beneficial to many applications. In recent years, a number of papers on heart rate (HR) measurement from videos have been proposed. However, these methods typically require the human subject to be stationary and for the illumination to be controlled. For methods that do take into account motion and illumination changes, strong assumptions are still made about the environment (e.g. background can be used for illumination rectification). In this paper, we propose an HR measurement method that is robust to motion, illumination changes, and does not require use of an environment's background. We present conditions under which cardiac activity extraction from local regions of the face can be treated as a linear Blind Source Separation problem and propose a simple but robust algorithm for selecting good local regions. The independent HR estimates from multiple local regions are then combined in a majority voting scheme that robustly recovers the HR. We validate our algorithm on a large database of challenging videos.
Conference Paper
Full-text available
Layer-sequential unit-variance (LSUV) initialization -- a simple method for weight initialization for deep net learning -- is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.
Article
Full-text available
Vital signs such as pulse rate and breathing rate are currently measured using contact probes. But, non-contact methods for measuring vital signs are desirable both in hospital settings (e.g. in NICU) and for ubiquitous in-situ health tracking (e.g. on mobile phone and computers with webcams). Recently, camera-based non-contact vital sign monitoring have been shown to be feasible. However, camera-based vital sign monitoring is challenging for people with darker skin tone, under low lighting conditions, and/or during movement of an individual in front of the camera. In this paper, we propose distancePPG, a new camera-based vital sign estimation algorithm which addresses these challenges. DistancePPG proposes a new method of combining skin-color change signals from different tracked regions of the face using a weighted average, where the weights depend on the blood perfusion and incident light intensity in the region, to improve the signal-to-noise ratio (SNR) of camera-based estimate. One of our key contributions is a new automatic method for determining the weights based only on the video recording of the subject. The gains in SNR of camera-based PPG estimated using distancePPG translate into reduction of the error in vital sign estimation, and thus expand the scope of camera-based vital sign monitoring to potentially challenging scenarios. Further, a dataset will be released, comprising of synchronized video recordings of face and pulse oximeter based ground truth recordings from the earlobe for people with different skin tones, under different lighting conditions and for various motion scenarios.
Article
Full-text available
Remote photoplethysmography (rPPG) techniques can measure cardiac activity by detecting pulse-induced colour variations on human skin using an RGB camera. State-of-theart rPPG methods are sensitive to subject body motions (e.g., motion-induced colour distortions). This study proposes a novel framework to improve the motion robustness of rPPG. The basic idea of this work originates from the observation that a camera can simultaneously sample multiple skin regions in parallel, and each of them can be treated as an independent sensor for pulse measurement. The spatial-redundancy of an image sensor can thus be exploited to distinguish the pulsesignal from motion-induced noise. To this end, the pixel-based rPPG sensors are constructed to estimate a robust pulse-signal using motion-compensated pixel-to-pixel pulse extraction, spatial pruning, and temporal filtering. The evaluation of this strategy is not based on a full clinical trial, but on 36 challenging benchmark videos consisting of subjects that differ in gender, skin-types and performed motion-categories. Experimental results show that the proposed method improves the SNR of the state-of-the-art rPPG technique from 3.34dB to 6.76dB, and the agreement (1:96) with instantaneous reference pulse-rate from 55% to 80% correct. ANOVA with post-hoc comparison shows that the improvement on motion robustness is significant. The rPPG method developed in this study has a performance that is very close to that of the contact-based sensor under realistic situations, while its computational efficiency allows real-time processing on an off-the-shelf computer.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
We present a simple, low-cost method for measuring multiple physiological parameters using a basic webcam. By applying independent component analysis on the color channels in video recordings, we extracted the blood volume pulse from the facial regions. Heart rate (HR), respiratory rate, and HR variability (HRV, an index for cardiac autonomic activity) were subsequently quantified and compared to corresponding measurements using Food and Drug Administration-approved sensors. High degrees of agreement were achieved between the measurements across all physiological parameters. This technology has significant potential for advancing personal health care and telemedicine.
Article
Full-text available
Remote measurements of the cardiac pulse can provide comfortable physiological assessment without electrodes. However, attempts so far are non-automated, susceptible to motion artifacts and typically expensive. In this paper, we introduce a new methodology that overcomes these problems. This novel approach can be applied to color video recordings of the human face and is based on automatic face tracking along with blind source separation of the color channels into independent components. Using Bland-Altman and correlation analysis, we compared the cardiac pulse rate extracted from videos recorded by a basic webcam to an FDA-approved finger blood volume pulse (BVP) sensor and achieved high accuracy and correlation even in the presence of movement artifacts. Furthermore, we applied this technique to perform heart rate measurements from three participants simultaneously. This is the first demonstration of a low-cost accurate video-based method for contact-free heart rate measurements that is automated, motion-tolerant and capable of performing concomitant measurements on more than one person at a time.
Article
Full-text available
Plethysmographic signals were measured remotely (>1m) using ambient light and a simple consumer level digital camera in movie mode. Heart and respiration rates could be quantified up to several harmonics. Although the green channel featuring the strongest plethysmographic signal, corresponding to an absorption peak by (oxy-) hemoglobin, the red and blue channels also contained plethysmographic information. The results show that ambient light photo-plethysmography may be useful for medical purposes such as characterization of vascular skin lesions (e.g., port wine stains) and remote sensing of vital signs (e.g., heart and respiration rates) for triage or sports purposes.
Conference Paper
Our goal is to reveal temporal variations in videos that are diffi- cult or impossible to see with the naked eye and display them in an indicative manner. Our method, which we call Eulerian Video Magnification, takes a standard video sequence as input, and applies spatial decomposition, followed by temporal filtering to the frames. The resulting signal is then amplified to reveal hidden information. Using our method, we are able to visualize the flow of blood as it fills the face and also to amplify and reveal small motions. Our technique can run in real time to show phenomena occurring at temporal frequencies selected by the user.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Recent studies in computer vision have shown that, while practically invisible to a human observer, skin color changes due to blood flow can be captured on face videos and, surprisingly, be used to estimate the heart rate (HR). While considerable progress has been made in the last few years, still many issues remain open. In particular, state-of-the-art approaches are not robust enough to operate in natural conditions (e.g. in case of spontaneous movements, facial expressions, or illumination changes). Opposite to previous approaches that estimate the HR by processing all the skin pixels inside a fixed region of interest, we introduce a strategy to dynamically select face regions useful for robust HR estimation. Our approach, inspired by recent advances on matrix completion theory, allows us to predict the HR while simultaneously discover the best regions of the face to be used for estimation. Thorough experimental evaluation conducted on public benchmarks suggests that the proposed approach significantly outperforms state-of-the-art HR estimation methods in naturalistic conditions.
Conference Paper
As wide spreading of camera-equipped devices to the daily living environment, there are enormous opportunities to utilize the camera-based remote photoplethysmography (PPG) for daily physiological monitoring. In the camera-based remote PPG (rPPG) monitoring, the region of interest (ROI) is related to the signal quality and the computational load for the signal extraction processing. Designating the best ROI on the body while minimizing its size is essential for computationally efficient rPPG extraction. In this study, we densely analyzed the face region to find the computationally efficient ROI for facial rPPG extraction. We divided the face into seven regions and evaluated the quality of the signal of each region using the area ratio of high-SNR and high-correlation, and mean and standard deviation (SD) of SNR and correlation coefficient. The results show that a forehead and both cheeks especially have a potential to be a good candidates for computationally efficient ROI. On the other hand, the signal quality from a mouth and a chin was relatively low. A nasion and a nose have a limitation to be efficient ROI.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Conference Paper
Remote photoplethysmography (rPPG) enables measuring heart rate from recorded skin color variations with consumer cameras. Recent research has aimed to improve the signal strength of color variations caused by heart beat by using independent component analysis (ICA) technique or analyzing chrominance-based model. In this paper, we argue for treating this emerging problem in a novel aspect - proposing a learning-based framework to accommodate multiple and temporal feature and yielding significant and robust improvement. Using support vector regression (SVR) on published chrominance-based feature improves the root mean square error (RMSE) from 22.7 to 7.31 as well as correlation coefficient (CC) from 0.30 to 0.77. With proposed novel multiple feature fusion and multiple segment fusion techniques, we achieved the best estimation result with RMSE 5.48 and CC 0.88. Meanwhile, the proposed framework can be extended to other promising features.
Conference Paper
We extract heart rate and beat lengths from videos by measuring subtle head motion caused by the Newtonian reaction to the influx of blood at each beat. Our method tracks features on the head and performs principal component analysis (PCA) to decompose their trajectories into a set of component motions. It then chooses the component that best corresponds to heartbeats based on its temporal frequency spectrum. Finally, we analyze the motion projected to this component and identify peaks of the trajectories, which correspond to heartbeats. When evaluated on 18 subjects, our approach reported heart rates nearly identical to an electrocardiogram device. Additionally we were able to capture clinically relevant information about heart rate variability.
Article
Our goal is to reveal temporal variations in videos that are difficult or impossible to see with the naked eye and display them in an indicative manner. Our method, which we call Eulerian Video Magnification, takes a standard video sequence as input, and applies spatial decomposition, followed by temporal filtering to the frames. The resulting signal is then amplified to reveal hidden information. Using our method, we are able to visualize the flow of blood as it fills the face and also to amplify and reveal small motions. Our technique can run in real time to show phenomena occurring at the temporal frequencies selected by the user.
Article
Remote photoplethysmography (rPPG) enables contactless monitoring of the blood volume pulse using a regular camera. Recent research focused on improved motion robustness, but the proposed blind source separation techniques (BSS) in RGB color space show limited success. We present an analysis of the motion problem, from which far superior chrominance-based methods emerge. For a population of 117 stationary subjects, we show our methods to perform in 92% good agreement (1:96) with contact PPG, with RMSE and standard deviation both a factor of two better than BSS-based methods. In a fitness setting using a simple spectral peak detector, the obtained pulse-rate for modest motion (bike) improves from 79% to 98% correct, and for vigorous motion (stepping) from less than 11% to more than 48% correct. We expect the greatly improved robustness to considerably widen the application scope of the technology.
Article
MAHNOB-HCI is a multi-modal database recorded in response to affective stimuli with the goal of emotion recognition and implicit tagging research. A multi-modal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data, and peripheral/central nervous system physiological signals. 27 participants from both genders and different cultural backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt emotions using arousal, valence, dominance and predictability as well as emotional keywords. In the second experiment, short videos and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The database is made available to the academic community via a web-based system. The collected data were analyzed and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol.
Deep learning with time-frequency representation for pulse estimation
  • M.-S C Gee-Sern
  • Arulmurugan Hsu
  • Ambikapathi
Statistical methods for assessing agreement between two methods of clinical measurement
  • J M Bland
  • D Altman