FIGURE 3 - uploaded by Hamid Heydarian
Content may be subject to copyright.
Effect of different smoothing approaches of gravity-removed accelerometer data from an intake gesture.

Effect of different smoothing approaches of gravity-removed accelerometer data from an intake gesture.

Source publication
Article
Full-text available
Wrist-worn inertial measurement units have emerged as a promising technology to passively capture dietary intake data. State-of-the-art approaches use deep neural networks to process the collected inertial data and detect characteristic hand movements associated with intake gestures. In order to clarify the effects of data preprocessing, sensor mod...

Context in source publication

Context 1
... has not been applied in this context so far. Based on our experiments, median filter with a window size of five frames outperformed other smoothing methods on our 64 Hz data. The general purpose of smoothing is to remove noise associated with short-term fluctuations in the sensor data (e.g. slight wrist tremor, technical sensor limitations) [30]. Fig. 3 illustrates the effect of different smoothing approaches on gravity-removed accelerometer ...

Citations

... Performance comparison between the proposed approach and OREBA baseline methods is shown in Table IV. The baseline methods are provided for two different modalities, inertial and video, from previous studies on the OREBA dataset in [35] and [36], respectively. The total number of intake gestures in this table is approximately 935 (OREBA-DIS) and 803 (OREBA-SHA), which denote approximately 1/5 of the total intakes in each case. ...
Article
Full-text available
Dietary patterns can be the primary reason for many chronic diseases such as diabetes and obesity. State-of-the-art wearable sensor technologies can play a critical role in assisting patients in managing their eating habits by providing meaningful statistics on critical parameters such as the onset, duration, and frequency of eating. For an accurate yet fast food intake recognition, this work presents a novel Machine Learning (ML) based framework that shows promising results by leveraging optimized support vector machine (SVM) classifiers. The SVM classifiers are trained on three comprehensive datasets: OREBA, FIC, and CLEMSON. The developed framework outperforms existing algorithms by achieving F1-scores of 92%, 94%, 95%, and 85% on OREBA-SHA, OREBA-DIS, FIC, and CLEMSON datasets, respectively. In order to assess the generalization aspects, the proposed SVM framework is also trained on one of the three databases while being tested on the others and achieves acceptable F1-scores in all cases. The proposed algorithm is well suited for real-time applications since inference is made using a few support vector parameters compared to thousands in peer deep neural networks models.
... Other approaches attempt to leverage the power of modern wearables, such as smart-watches, by using the inertial sensors (i.e. accelerometer and optionally gyroscope) that are commonly found on such devices, in order to detect and identify eating gestures [15,14,12], Other approaches rely on the availability of cameras to identify food type [6,5] from single plate photographs, or to directly segment food images into food components [7,9]. It is also possible to estimate food volume using a depth camera [16] and the caloric content based on a photograph using a reference object [11]. ...
... They established a twostage detection scheme [4] to first identify intake frames and next detect intake gestures. Heydarian et al. [6] adopted this approach and proposed an inertial model that outperformed existing intake gesture detection models on the publicly available, multimodal OREBA-DIS dataset [7]. Rouast et al. [1] also adopted the two-stage detection scheme and compared different deep learning approaches for intake gestures detection using video data from the same dataset. ...
... In this paper, we address this gap by exploring the potential of data fusion in this context. Thereby, we build on the twostage approach by Kyritsis et al. [4] and consider the outputs of Heydarian et al.'s [6] inertial model and Rouast et al.'s [1] best video model. With respect to fusion, we compare scorelevel fusion (i.e., fusion of the probability outputs) and decision-level fusion (i.e., fusion of the decision outputs). ...
... The term intake gesture refers to a hand-to-mouth gesture associated with dietary intake (e.g. raising a spoon, fork, or cup) [6]. Automatic intake gesture detection attempts to classify hand gestures into intake vs non-intake gestures (e.g., touching hair, scratching face) in an eating activity using classification techniques [6]. ...
Article
Full-text available
Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (F1 = 0.871) outperforms both individual video (F1 = 0.855) and inertial (F1 = 0.806) models. However, on the OREBA-SHA dataset, the max score fusion approach (F1 = 0.873) fails to outperform the individual inertial model (F1 = 0.895). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (p<.001).
... The two-stage approach introduced by Kyritsis et al. [9] is currently the most advanced approach benchmarked on publicly available datasets for both inertial [9] and video data The 1. F 1 scores for our two-stage and single-stage models in comparison with the state of the art (SOTA). Our single-stage models see relative improvements between 3.3% and 17.7% over our implementations of the SOTA for inertial [10] and video modalities [6], and relative improvements between 1.9% and 6.2% over our own two-stage models for intake detection and eating/drinking detection across the OREBA and Clemson datasets. [6]. ...
... Hence, to facilitate a fair comparison, we also train several two-stage models based on 8 second time windows. In particular, we use cross-entropy loss to train two-stage versions of our own architectures outlined in Table I, as well as the architectures proposed in Heydarian et al. [10], Rouast et al. [6], and the adapted version of Kyritsis et al. [9] used in [10]. Note that the latter was originally designed to be trained with additional sub-gesture labels which are not available for the Clemson and OREBA datasets. ...
... Hence, to facilitate a fair comparison, we also train several two-stage models based on 8 second time windows. In particular, we use cross-entropy loss to train two-stage versions of our own architectures outlined in Table I, as well as the architectures proposed in Heydarian et al. [10], Rouast et al. [6], and the adapted version of Kyritsis et al. [9] used in [10]. Note that the latter was originally designed to be trained with additional sub-gesture labels which are not available for the Clemson and OREBA datasets. ...
Article
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) framelevel intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative F1 score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.
... The two-stage approach introduced by Kyritsis et al. [9] is currently the most advanced approach benchmarked on publicly available datasets for both inertial [9] and video data The [10] Two-stage (ours) Single-stage (ours) Fig. 1. F 1 scores for our two-stage and single-stage models in comparison with the current state of the art (SOTA). ...
... F 1 scores for our two-stage and single-stage models in comparison with the current state of the art (SOTA). Our single-stage models see relative improvements of 10.2% and 2.6% over the SOTA for inertial [10] and video-based intake detection [6] on the OREBA dataset, and relative improvements between 2.0% and 6.2% over comparable two-stage models for intake detection and eating vs. drinking detection tasks across the OREBA and Clemson datasets. [6]. ...
... In the experiments, we compare the proposed single-stage approach to the thresholding approach [4] and the two-stage approach [9] [10]. We consider two datasets of annotated intake gestures: The OREBA dataset [6] and the Clemson Cafeteria dataset [28]. ...
Preprint
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.
... The OREBA dataset includes (i) raw sensor data without any processing for left and right hand (e.g., <id>_inertial_raw.csv), and (ii) processed sensor data for dominant and non-dominant eating hand (e.g., <id>_inertial_processed.csv). Raw data is included since a recent study on OREBA indicates that data preprocessing only marginally improves results when combined with deep learning [20]. Processed data is generated from the raw data according to the following steps: ...
... For each modality, we use one simple CNN and one more complex model proposed in previous studies [20] [24]. As listed in Table 4, this results in a total of eight baseline models, considering the different scenarios, modalities, and models. ...
... The inertial models are taken from a recent study on OREBA by Heydarian et al. [20]. We compare the simple CNN with the more complex CNN-LSTM proposed in the aforementioned work. ...
Preprint
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of the labelled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as technical details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research.
... The OREBA dataset includes (i) raw sensor data without any processing for left and right hand (e.g., <id>_inertial_raw.csv), and (ii) processed sensor data for dominant and non-dominant eating hand (e.g., <id>_inertial_processed.csv). Raw data is included since a recent study on OREBA indicates that data preprocessing only marginally improves results when combined with deep learning [20]. Processed data is generated from the raw data according to the following steps: ...
... For each modality, we use one simple CNN and one more complex model proposed in previous studies [20], [24]. As listed in Table 4, this results in a total of eight baseline models, considering the different scenarios, modalities, and models. ...
... The inertial models are taken from a recent study on OREBA by Heydarian et al. [20]. We compare the simple CNN with the more complex CNN-LSTM proposed in the aforementioned work. ...
Article
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of labeled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded during the course of communal meals for researchers interested in intake gesture detection. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consist of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research. Specifically, the best baseline models achieve performances of F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.853 for the discrete dish using video and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.852 for the shared dish using inertial data.
Chapter
Automatic detection of food intake (eating episodes) is the very first step in dietary assessment. Traditional methods such as food diaries are being replaced by reliable, more accurate technology-driven sensor-based methods. This article presents a systematic review of the use of sensors for automatic food intake detection. The review was conducted using the PRISMA guidelines, the full text of 111 scientific articles was reviewed. The contributions of this paper are twofold: (i) a comprehensive review of state-of-the-art passive (requiring no user input) food intake detection methods was conducted and five types (chewing, swallowing, motion, physiological, and environment) of eating proxies and seven sensor modalities (distance, physiological, strain, acoustic, motion, imaging, and others) were identified and hence a taxonomy was developed, (ii) the accuracy of food intake detection and the applicability in free-living applications were assessed. The paper concludes with a discussion of the challenges and future direction of automatic food intake detection.