FIGURE 2 - uploaded by Hamid Heydarian
Content may be subject to copyright.
Axes and rotations of accelerometer and gyroscope sensors on the left and right wrists.

Axes and rotations of accelerometer and gyroscope sensors on the left and right wrists.

Source publication
Article
Full-text available
Wrist-worn inertial measurement units have emerged as a promising technology to passively capture dietary intake data. State-of-the-art approaches use deep neural networks to process the collected inertial data and detect characteristic hand movements associated with intake gestures. In order to clarify the effects of data preprocessing, sensor mod...

Contexts in source publication

Context 1
... data corresponds to the sensor's internal coordinate system. To mirror acceleration data horizontally, we flipped the sign for the x axis, which corresponds to the horizontal direction (see Fig. 2). We also flipped the signs for x and y axis to compensate for the difference in sensor orientation between left and right wrist in our experiments 2 (see [22] for a similar approach). Combined, this yields the ...
Context 2
... the gyroscope data, we flipped the signs of the y and z axis to mirror rotations horizontally; as before, we also flipped the signs for x and y axis to compensate for different 2 We deliberately decided for the sensor orientation shown in Fig. 2 to ensure that all participants wear the sensors uniformly. Specifically, participants were instructed to wear the sensor such that they were able to read the label on the sensor. Another approach would have been to wear the sensors in the same direction which changes the mirroring formula for accelerometer to a x , a y , a z = −a x , ...
Context 3
... was collected from both hands using wrist-worn tri-axial accelerometers and tri-axial gyroscopes at a sampling rate frequency of 64 Hz (Movisens Move 3 G). Fig. 2 shows the axes and rotation direction of inertial sensors used on the left and right hands. The data collection setup included a group setting of four participants who each individually consumed a standardized meal of lasagna, bread, yogurt, and water (no shared dishes). However, some sessions were conducted with two or three ...

Citations

... They established a twostage detection scheme [4] to first identify intake frames and next detect intake gestures. Heydarian et al. [6] adopted this approach and proposed an inertial model that outperformed existing intake gesture detection models on the publicly available, multimodal OREBA-DIS dataset [7]. Rouast et al. [1] also adopted the two-stage detection scheme and compared different deep learning approaches for intake gestures detection using video data from the same dataset. ...
... In this paper, we address this gap by exploring the potential of data fusion in this context. Thereby, we build on the twostage approach by Kyritsis et al. [4] and consider the outputs of Heydarian et al.'s [6] inertial model and Rouast et al.'s [1] best video model. With respect to fusion, we compare scorelevel fusion (i.e., fusion of the probability outputs) and decision-level fusion (i.e., fusion of the decision outputs). ...
... The term intake gesture refers to a hand-to-mouth gesture associated with dietary intake (e.g. raising a spoon, fork, or cup) [6]. Automatic intake gesture detection attempts to classify hand gestures into intake vs non-intake gestures (e.g., touching hair, scratching face) in an eating activity using classification techniques [6]. ...
Article
Full-text available
Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (F1 = 0.871) outperforms both individual video (F1 = 0.855) and inertial (F1 = 0.806) models. However, on the OREBA-SHA dataset, the max score fusion approach (F1 = 0.873) fails to outperform the individual inertial model (F1 = 0.895). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (p<.001).
... The two-stage approach introduced by Kyritsis et al. [9] is currently the most advanced approach benchmarked on publicly available datasets for both inertial [9] and video data The 1. F 1 scores for our two-stage and single-stage models in comparison with the state of the art (SOTA). Our single-stage models see relative improvements between 3.3% and 17.7% over our implementations of the SOTA for inertial [10] and video modalities [6], and relative improvements between 1.9% and 6.2% over our own two-stage models for intake detection and eating/drinking detection across the OREBA and Clemson datasets. [6]. ...
... Hence, to facilitate a fair comparison, we also train several two-stage models based on 8 second time windows. In particular, we use cross-entropy loss to train two-stage versions of our own architectures outlined in Table I, as well as the architectures proposed in Heydarian et al. [10], Rouast et al. [6], and the adapted version of Kyritsis et al. [9] used in [10]. Note that the latter was originally designed to be trained with additional sub-gesture labels which are not available for the Clemson and OREBA datasets. ...
... Hence, to facilitate a fair comparison, we also train several two-stage models based on 8 second time windows. In particular, we use cross-entropy loss to train two-stage versions of our own architectures outlined in Table I, as well as the architectures proposed in Heydarian et al. [10], Rouast et al. [6], and the adapted version of Kyritsis et al. [9] used in [10]. Note that the latter was originally designed to be trained with additional sub-gesture labels which are not available for the Clemson and OREBA datasets. ...
Article
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) framelevel intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative F1 score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.
... The two-stage approach introduced by Kyritsis et al. [9] is currently the most advanced approach benchmarked on publicly available datasets for both inertial [9] and video data The [10] Two-stage (ours) Single-stage (ours) Fig. 1. F 1 scores for our two-stage and single-stage models in comparison with the current state of the art (SOTA). ...
... F 1 scores for our two-stage and single-stage models in comparison with the current state of the art (SOTA). Our single-stage models see relative improvements of 10.2% and 2.6% over the SOTA for inertial [10] and video-based intake detection [6] on the OREBA dataset, and relative improvements between 2.0% and 6.2% over comparable two-stage models for intake detection and eating vs. drinking detection tasks across the OREBA and Clemson datasets. [6]. ...
... In the experiments, we compare the proposed single-stage approach to the thresholding approach [4] and the two-stage approach [9] [10]. We consider two datasets of annotated intake gestures: The OREBA dataset [6] and the Clemson Cafeteria dataset [28]. ...
Preprint
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.
... The OREBA dataset includes (i) raw sensor data without any processing for left and right hand (e.g., <id>_inertial_raw.csv), and (ii) processed sensor data for dominant and non-dominant eating hand (e.g., <id>_inertial_processed.csv). Raw data is included since a recent study on OREBA indicates that data preprocessing only marginally improves results when combined with deep learning [20]. Processed data is generated from the raw data according to the following steps: ...
... For each modality, we use one simple CNN and one more complex model proposed in previous studies [20] [24]. As listed in Table 4, this results in a total of eight baseline models, considering the different scenarios, modalities, and models. ...
... The inertial models are taken from a recent study on OREBA by Heydarian et al. [20]. We compare the simple CNN with the more complex CNN-LSTM proposed in the aforementioned work. ...
Preprint
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of the labelled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as technical details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research.
... The OREBA dataset includes (i) raw sensor data without any processing for left and right hand (e.g., <id>_inertial_raw.csv), and (ii) processed sensor data for dominant and non-dominant eating hand (e.g., <id>_inertial_processed.csv). Raw data is included since a recent study on OREBA indicates that data preprocessing only marginally improves results when combined with deep learning [20]. Processed data is generated from the raw data according to the following steps: ...
... For each modality, we use one simple CNN and one more complex model proposed in previous studies [20], [24]. As listed in Table 4, this results in a total of eight baseline models, considering the different scenarios, modalities, and models. ...
... The inertial models are taken from a recent study on OREBA by Heydarian et al. [20]. We compare the simple CNN with the more complex CNN-LSTM proposed in the aforementioned work. ...
Article
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of labeled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded during the course of communal meals for researchers interested in intake gesture detection. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consist of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research. Specifically, the best baseline models achieve performances of F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.853 for the discrete dish using video and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.852 for the shared dish using inertial data.
Chapter
Automatic detection of food intake (eating episodes) is the very first step in dietary assessment. Traditional methods such as food diaries are being replaced by reliable, more accurate technology-driven sensor-based methods. This article presents a systematic review of the use of sensors for automatic food intake detection. The review was conducted using the PRISMA guidelines, the full text of 111 scientific articles was reviewed. The contributions of this paper are twofold: (i) a comprehensive review of state-of-the-art passive (requiring no user input) food intake detection methods was conducted and five types (chewing, swallowing, motion, physiological, and environment) of eating proxies and seven sensor modalities (distance, physiological, strain, acoustic, motion, imaging, and others) were identified and hence a taxonomy was developed, (ii) the accuracy of food intake detection and the applicability in free-living applications were assessed. The paper concludes with a discussion of the challenges and future direction of automatic food intake detection.