The spherical video is remapped to equirectangular representation, cropped, and reshaped to square shape.

The spherical video is remapped to equirectangular representation, cropped, and reshaped to square shape.

Source publication
Article
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of labeled sensor data to automatically learn how to make detections. One characteristic, es...

Contexts in source publication

Context 1
... OREBA-DIS: In the first scenario, foods were served in discrete portions to each participant. The meal consisted of lasagna (choice between vegetarian and beef), bread, and yogurt. Additionally, there was water available to drink, and butter to spread on the bread. The study setup for OREBA-DIS is shown in Fig. 1. 2) OREBA-SHA: In the second scenario, participants consumed a communal dish of vegetable korma or butter chicken with rice and mixed vegetables. Additionally, there was water available to drink. The study setup for OREBA-SHA is shown in Fig. 2. Lasagna and rice-based dishes were chosen since they are amongst the most common dishes in ...
Context 2
... shown in Fig. 1, we first mapped the spherical video from the 360-degree camera to equirectangular representation 7 . Then, we separated the equirectangular representation into individual participant videos by cropping the areas of interest. We further resized each participant video to a square shape. The two spatial resolutions 140 × 140 (e.g., ...

Citations

... They established a twostage detection scheme [4] to first identify intake frames and next detect intake gestures. Heydarian et al. [6] adopted this approach and proposed an inertial model that outperformed existing intake gesture detection models on the publicly available, multimodal OREBA-DIS dataset [7]. Rouast et al. [1] also adopted the two-stage detection scheme and compared different deep learning approaches for intake gestures detection using video data from the same dataset. ...
... With respect to fusion, we compare scorelevel fusion (i.e., fusion of the probability outputs) and decision-level fusion (i.e., fusion of the decision outputs). We conduct our experiments on the publicly available multimodal OREBA-DIS (discrete dish, 100 participants) and OREBA-SHA (shared dish, 102 participants) [7] from the OREBA datasets with 180 unique participants in total. The contributions of this study are as follows: ...
... Intake gesture detection from data recorded by wearable inertial sensors has been explored since 2005 [13] using different machine learning algorithms [2]. Since 2017 [14], deep learning (e.g., [3], [5], [6]) has been used to improve intake gesture detection, particularly with the availability of annotated datasets (e.g., [4], [7], [15]). Especially recurrent neural networks (RNNs) with their ability to take the previous states of data into account [16] (e.g., [17]- [19]), have recently been used to model the temporal context of inertial and video data (e.g., [1], [4], [6]). ...
Article
Full-text available
Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (F1 = 0.871) outperforms both individual video (F1 = 0.855) and inertial (F1 = 0.806) models. However, on the OREBA-SHA dataset, the max score fusion approach (F1 = 0.873) fails to outperform the individual inertial model (F1 = 0.895). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (p<.001).
... A clinic's doctors or their assistants can use the inertial sensors to track patients' motions to assist in rehabilitation or disease diagnosis [9]- [11]. Some studies have collected data from many participants wearing IMUs during everyday life for an action recognition task [12]- [14]. ...
Article
Full-text available
Due to the recent technological advances in inertial measurement units (IMUs), many applications for the measurement of human motion using multiple body-worn IMUs have been developed. In these applications, each IMU has to be attached to a predefined body segment. A technique to identify the body segment on which each IMU is mounted allows users to attach inertial sensors to arbitrary body segments, which avoids having to remeasure due to incorrect attachment of the sensors. We address this IMU-to-segment assignment problem and propose a novel end-to-end learning model that incorporates a global feature generation module and an attention-based mechanism. The former extracts the feature representing the motion of all attached IMUs, and the latter enable the model to learn the dependency relationships between the IMUs. The proposed model thus identifies the IMU placement based on the features from global motion and relevant IMUs. We quantitatively evaluated the proposed method using synthetic and real public datasets with three sensor configurations, including a full-body configuration mounting 15 sensors. The results demonstrated that our approach significantly outperformed the conventional and baseline methods for all datasets and sensor configurations.
... The intake monitoring data are annotated with hand, utensil, container, and food. The dataset for Objectively Recognizing Eating Behaviour and Associated Intake (OREBA) is dedicated for IMU data of both hands synchronized with frontal camera on 180 participants during food intake [186]. All of these datasets include videos for ground truth, albeit the unavailability to the public. ...
Article
This comprehensive review mainly analyzes and summarizes the recently published works on IEEExplore in sensor-driven smart living contexts. We have gathered over 150 research papers, especially in the past five years. We categorize them into four major research directions: activity tracker, affective computing, sleep monitoring, and ingestive behavior. We report each research direction’s summary by following our defined sensor types: biomedical sensors, mechanical sensors, non-contact sensors, and others. Furthermore, the review behaves as one-stop service literature appropriate for novices who intend to research the direction of sensor-driven applications towards smart living. In conclusion, the state-of-the-art works, the publicity available data sources, and the future challenge issues (sensor selection, algorithms, and privacy) are the major contributions of this proposed article.
... More recent developments include the use of machine learning to learn features automatically [5] and learning from video, which has become more practical with emerging spherical camera technology [6] [7]. Research on the OREBA dataset showed that frontal video data can exhibit even higher accuracies in detecting eating gestures than inertial data [8]. ...
... C. Datasets 1) OREBA: The OREBA dataset [8] includes inertial and video data. This dataset was approved by the IRB at The University of Newcastle on 10 September 2017 (H-2017-0208). ...
... Specifically, we use the OREBA-DIS scenario with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. The split suggested by the dataset authors [8] includes training, validation, and test sets of 61, 20, and 19 participants. For the inertial models, we use the processed 6 accelerometer and gyroscope data from both wrists at 64 Hz (8 seconds correspond to 512 frames). ...
Article
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) framelevel intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative F1 score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.
... More recent developments include the use of machine learning to learn features automatically [5] and learning from video, which has become more practical with emerging spherical camera technology [6] [7]. Research on the OREBA dataset showed that frontal video data can exhibit even higher accuracies in detecting eating gestures than inertial data [8]. ...
... C. Datasets 1) OREBA: The OREBA dataset [8] includes both inertial and video data. Specifically, we are using the scenario OREBA-DIS with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. ...
... Specifically, we are using the scenario OREBA-DIS with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. Data are split into training, validation, and test sets of 61, 20, and 19 participants according to the split suggested by the dataset authors [8]. For our inertial models, we use the processed 5 data from accelerometer and gyroscope readings for both wrists at 64 Hz. ...
Preprint
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.
Article
Full-text available
Nowadays, individuals have very stressful lifestyles, affecting their nutritional habits. In the early stages of life, teenagers begin to exhibit bad habits and inadequate nutrition. Likewise, other people with dementia, Alzheimer’s disease, or other conditions may not take food or medicine regularly. Therefore, the ability to monitor could be beneficial for them and for the doctors that can analyze the patterns of eating habits and their correlation with overall health. Many sensors help accurately detect food intake episodes, including electrogastrography, cameras, microphones, and inertial sensors. Accurate detection may provide better control to enable healthy nutrition habits. This paper presents a systematic review of the use of technology for food intake detection, focusing on the different sensors and methodologies used. The search was performed with a Natural Language Processing (NLP) framework that helps screen irrelevant studies while following the PRISMA methodology. It automatically searched and filtered the research studies in different databases, including PubMed, Springer, ACM, IEEE Xplore, MDPI, and Elsevier. Then, the manual analysis selected 30 papers based on the results of the framework for further analysis, which support the interest in using sensors for food intake detection and nutrition assessment. The mainly used sensors are cameras, inertial, and acoustic sensors that handle the recognition of food intake episodes with artificial intelligence techniques. This research identifies the most used sensors and data processing methodologies to detect food intake.