ArticlePDF Available

Abstract and Figures

Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of labeled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded during the course of communal meals for researchers interested in intake gesture detection. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consist of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research. Specifically, the best baseline models achieve performances of F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.853 for the discrete dish using video and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.852 for the shared dish using inertial data.
Content may be subject to copyright.
Received September 10, 2020, accepted September 23, 2020, date of publication September 28, 2020,
date of current version October 15, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3026965
OREBA: A Dataset for Objectively Recognizing
Eating Behavior and Associated Intake
PHILIPP V. ROUAST 1, (Member, IEEE), HAMID HEYDARIAN 1,
MARC T. P. ADAM 1,3, AND MEGAN E. ROLLO2,3
1School of Electrical Engineering and Computing, The University of Newcastle, Callaghan, NSW 2308, Australia
2School of Health Sciences, The University of Newcastle, Callaghan, NSW 2308, Australia
3Priority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW 2308, Australia
Corresponding author: Marc T. P. Adam (marc.adam@newcastle.edu.au)
This work was supported by the Bill & Melinda Gates Foundation under Grant OPP1171389. The work of Philipp V. Rouast and Hamid
Heydarian were supported by the Australian Government Research Training (RTP) Scholarship.
ABSTRACT Automatic detection of intake gestures is a key element of automatic dietary monitoring.
Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for
this purpose. The common machine learning approaches make use of labeled sensor data to automatically
learn how to make detections. One characteristic, especially for deep learning models, is the need for large
datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated
Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded
during the course of communal meals for researchers interested in intake gesture detection. Two scenarios
are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069
intake gestures. Available sensor data consist of synchronized frontal video and IMU with accelerometer
and gyroscope for both hands. We report the details of data collection and annotation, as well as details of
sensor processing. The results of studies on IMU and video data involving deep learning models are reported
to provide a baseline for future research. Specifically, the best baseline models achieve performances of
F1=0.853 for the discrete dish using video and F1=0.852 for the shared dish using inertial data.
INDEX TERMS Dietary monitoring, eating behavior assessment, accelerometer, communal eating,
gyroscope, 360-degree video camera.
I. INTRODUCTION
Traditional dietary assessment methods are reliant on
self-report data. While data captured with active methods
such as self-report and 24-hr recall are widely used in prac-
tice, they are not without limitations (e.g., human error,
time-consuming manual process) [1]. Automatic dietary
monitoring, where data is collected and processed indepen-
dent of the individual, has the potential to complement data
from traditional methods and reduce associated biases [2].
In addition, such systems have the potential to support per-
sonal self-monitoring solutions by providing individuals with
targeted eating behavior recommendations.
A key element of automatic dietary monitoring is the
detection of intake gestures (i.e., the process of moving food
or drink towards the mouth). Recent research on this task
focuses mainly on machine learning approaches which are
The associate editor coordinating the review of this manuscript and
approving it for publication was Vishal Srivastava.
characterized by a need for large amounts of labeled data.
This is especially true in conjunction with deep learning,
which has been applied in this context since 2017 [3]. How-
ever, collecting, synchronizing, and labeling data of eating
occasions is a work-intensive process. Hence, there is a need
for more public datasets to reduce barriers for researchers
to create new machine learning models, and to objectively
compare the performance of existing approaches [4], [5].
At the same time, current research on dietary monitoring
identified a gap in research on shared plate eating [6]. Com-
munal eating (i.e., eating occasions involving more than one
person) is not yet well understood, let alone the impact it has
on accuracy of automatic dietary monitoring. Hence, existing
research on capturing dietary intake from discrete dishes
needs to be complemented and contrasted with research on
the detection of intake from shared dishes.
In order to address these gaps, the present paper introduces
the Objectively Recognizing Eating Behavior and Associated
Intake (OREBA) dataset. The goal of OREBA is to facilitate
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 181955
P. V. Rouast et al.: OREBA: A Dataset for OREBA
TABLE 1. Public datasets of intake gestures with synchronized sensor data and annotations available.
the automatic detection of intake gestures in communal eating
across two scenarios (discrete dish and shared dish). By creat-
ing this dataset and making it available to the wider research
community, this paper makes four key contributions:
1) Large-scale dataset: We conducted a total of 202
meal recordings, with 180 unique individuals partic-
ipating who consented to their data being used by
other research institutions. In total, we captured 9069
intake gestures from discrete and shared dishes. Two
independent annotators labeled and cross-checked the
intake gestures.
2) Public availability: Progress in the research of
machine learning methods is tightly linked to the pub-
lic availability of labeled datasets (e.g. [10]–[12]). By
making the dataset publicly available to researchers,
OREBA can be used to objectively benchmark existing
and emerging machine learning approaches. At the
same time, it reduces the time-consuming burden for
researchers to collect, annotate and cross-check their
own data.
3) Communal eating: While existing research has pro-
vided important insights into automatically detecting
human intake gestures in individual settings, research
on communal eating is scant [6], [13]. To the best of our
knowledge, this is the first dataset capturing communal
eating from both discrete and shared dishes.
4) Multiple modalities: The dataset includes syn-
chronized frontal video and inertial measurement
unit (IMU) sensor data from both hands, along with
labels for each intake gesture. A single spherical
camera positioned in the center of the table made it
possible to capture the entire communal eating scene
of up to four participants, offering a full view of all
relevant gestures. While existing inertial datasets on
intake gesture detection often use video as ground
truth, none of the existing datasets currently include
video data as part of the synchronized sensor data
for analysis.
In the following, Section II gives an overview of the related
work and existing datasets, Section III introduces the data
collection and annotation process of the OREBA dataset in
detail, Section IV provides results from our initial studies
as baselines. Finally, we provide a discussion in Section V
and conclusions in Section VI.
II. RELATED WORK
A. AUTOMATIC DIETARY MONITORING
Automatic dietary monitoring encompasses three major
goals: (i) detecting the timing of intake events, (ii) recog-
nizing the type of food or drink, and (iii) estimating the
weight consumed. Detection of intake behavior, which is
associated with intake gestures, chews, and swallows, can be
considered as part of the first goal. Researchers have lever-
aged various different sensor types for this purpose. While
chews and swallows can be detected using audio signals [14],
intake gestures are typically handled using an IMU includ-
ing accelerometer and gyroscope sensors [13], [15]. Before
the application of deep learning architectures, the traditional
approach in this field reduced the dimensionality of the raw
sensor data by extracting handcrafted features based on expert
knowledge. Deep learning methods have been explored to
detect individual intake gestures with inertial sensor data
since 2017 [3] and with video data since 2018 [2], [4], [5],
whereby large amounts of labeled examples are leveraged
to let algorithms learn the features automatically. The most
widely used approach in this space builds on convolutional
neural networks (CNN) and long short-term memory (LSTM)
models [16], however gated recurrent unit (GRU) models
have also been applied, especially in the context of activity
recognition in daily living [17].
B. EXISTING DATASETS
To date, most published studies on recognition of intake
behavior rely on dedicated, private datasets collected for a
specific purpose. Considering the shift towards adoption of
deep learning techniques, we expect an increasing need for
large, public datasets that existing and emerging machine
learning approaches can objectively be benchmarked on.
Similar developments can be observed across several related
fields such as action recognition [10], affect recogni-
tion [12], [18], and object recognition [11].
Table 1provides an overview of publicly available datasets
on intake gestures which feature synchronized sensor data of
181956 VOLUME 8, 2020
P. V. Rouast et al.: OREBA: A Dataset for OREBA
FIGURE 1. The spherical video is remapped to equirectangular representation, cropped, and reshaped to square shape.
eating occasions with labels for individual gestures (intake
gestures or other eating related gestures)1.
The accelerometer and audio-based calorie estima-
tion (ACE) dataset2[7] contains seven participants with
audio and IMU data for both hands and the head. Anno-
tations of type and amount of food and drink are avail-
able for chews and swallows.
The Clemson Cafeteria dataset3[8] contains 264 partici-
pants and 488 recordings. IMU data is available at 15 Hz
for the dominant hand, along with scale measurements
for the tray. Each intake gesture is annotated with hand,
utensil, container, and food.
The Food Intake Cycle (FIC) dataset4[9], which con-
sists of 12 participants and 21 recordings, includes IMU
data for the dominant hand. The focus is on the micro-
movements during intake gestures.
While video is commonly used as ground truth, none of
the existing datasets currently include video data as part of
the synchronized sensor data for analysis. In terms of IMU
data and quantity of recorded intake events, we find that the
existing datasets are restricted either to data from only one
hand, a relatively low recording frequency (15 Hz), or few
participants. We aim to further the field by establishing the
OREBA dataset, which includes video and IMU from both
hands, at a quantity of intake events sufficient to train deep
learning models for both video and inertial modalities.
III. THE OREBA DATASET
The OREBA dataset aims to provide a comprehensive
multi-sensor recording of communal intake occasions for
researchers interested in automatic detection of intake ges-
tures and other behaviors associated with intake (e.g., serving
food onto a plate). Available sensor data consists of synchro-
nized frontal video and accelerometer and gyroscope for both
hands in two different scenarios (i.e., discrete dish and shared
dish). IRB approval was given (H-2017-0208), and the data
was recorded between Mar 2018 and Oct 2019.
1A related dataset is iHEARu-EAT [19], however we did not include it
here since it does not focus on intake events.
2See http://www.skleinberg.org/data.html
3See http://cecas.clemson.edu/ ahoover/cafeteria/. Recordings with miss-
ing annotations are excluded here.
4See https://mug.ee.auth.gr/intake-cycle-detection/
A. SCENARIOS
The OREBA dataset consists of two separate communal eat-
ing scenarios. In each scenario, groups of up to four partic-
ipants were simultaneously recorded consuming a meal at a
communal table:
1) OREBA-DIS: In the first scenario, foods were served
in discrete portions to each participant. The meal con-
sisted of lasagna (choice between vegetarian and beef),
bread, and yogurt. Additionally, there was water avail-
able to drink, and butter to spread on the bread. The
study setup for OREBA-DIS is shown in Fig. 1.
2) OREBA-SHA: In the second scenario, participants
consumed a communal dish of vegetable korma or
butter chicken with rice and mixed vegetables. Addi-
tionally, there was water available to drink. The study
setup for OREBA-SHA is shown in Fig. 2.
FIGURE 2. Study setup for OREBA-SHA. One camera in the center of the
table, IMU on each wrist, and four scales.
Lasagna and rice-based dishes were chosen since they are
amongst the most common dishes in similar studies [13].
All participants in each scenario are unique, however 22
participants participated in both scenarios.
B. SENSORS
For each group, video was recorded using a spherical cam-
era placed in the center of the shared table (360fly-4K5).
5See https://www.360fly.com/
VOLUME 8, 2020 181957
P. V. Rouast et al.: OREBA: A Dataset for OREBA
This allowed video recording to occur in a simultaneous and
unobtrusive way for all participants engaging in the commu-
nal eating occasion around the table. The sampling rates are
24 fps for OREBA-DIS, and 30 fps for OREBA-SHA. Each
participant wore two IMUs, one on each wrist (Movisens
Move 3+6). The IMU included an accelerometer and a
gyroscope with a sampling rate of 64 Hz. For OREBA-SHA,
four scales additionally recorded the weight of the communal
dishes (two rices dishes, one wet dish, one vegetable dish) at
1 Hz (Adam Equipment CBK 4).
C. SENSOR PROCESSING
1) VIDEO
As shown in Fig. 1, we first mapped the spherical video
from the 360-degree camera to equirectangular represen-
tation7. Then, we separated the equirectangular represen-
tation into individual participant videos by cropping the
areas of interest. We further resized each participant video
to a square shape. The two spatial resolutions 140 ×140
(e.g., <id>_video_140p.mp4) and 250 ×250 pix-
els (e.g., <id>_video_250p.mp4) are included. All
videos are encoded using the H.264 standard and stored in
mp4 containers.
2) INERTIAL MEASUREMENT UNIT
Raw accelerometer data is measured in g, while gyroscope
data is measured in deg/s. The OREBA dataset includes
(i) raw sensor data without any processing for left and right
hand (e.g., <id>_inertial_raw.csv), and (ii) pro-
cessed sensor data for dominant and non-dominant eating
hand (e.g., <id>_inertial_processed.csv). Raw
data is included since a recent study on OREBA indicates that
data preprocessing only marginally improves results when
combined with deep learning [20]. Processed data is gener-
ated from the raw data according to the following steps:
1: Removal of gravity effect. The raw accelerometer read-
ing is subject to acceleration from participants’ wrist move-
ments as well as the earth’s gravitational field. We remove this
gravity effect by estimating sensor orientation using sensor
fusion with Madgwick’s filter [21], rotation of the accelera-
tion vector with the resulting quaternion, and deduction of the
gravity vector (see [3] for a similar approach).
2: Standardization. Each column (i.e. each axis for each
modality and hand) is standardized by subtracting its mean
and dividing by its standard deviation (see [9] for a similar
approach). Processed data can hence be regarded as unitless.
3: Transforming from left and right hand to dominant and
non-dominant hand. To achieve data uniformity, we report
hands in the processed data as dominant and non-dominant.
A similar approach was chosen for the FIC dataset [16].
All data reported as dominant hands correspond to right
hands, and non-dominant hands to left hands; for left-handed
participants data for both hands has been transformed to
6See https://www.movisens.com/en/products/activity-sensor-move-3/
7See https://github.com/prouast/equirectangular-remap
FIGURE 3. The wrist-worn sensors with their internal coordinate frames.
Additionally, the direction of positive rotations are indicated for each axis.
achieve this. Specifically, we mirrored the data of left-handed
participants to transform the data from the left wrist as if
it has been recorded on the right wrist, and vice versa.
Due to the way the sensors are mounted on the wrist (see
Figure 3), the horizontal direction corresponds to the x axis.
For accelerometer data, we estimate mirroring by flipping the
sign of the x axis, and for gyroscope by flipping the signs of
the y and z axis. Further, we also flip the signs of the x and y
axis to compensate for the different sensor orientations on
the wrists, yielding transformation (1) for accelerometer and
(2) for gyroscope.
[a0
x,a0
y,a0
z]=[(ax),ay,az]=[ax,ay,az] (1)
[g0
x,g0
y,g0
z]=[gx,(gy),gz]=[gx,gy,gz] (2)
Note that the mirroring technique propopsed here could
also be of use for data augmentation pipelines.
3) SYNCHRONIZATION
Ground truth for sensor synchronization was acquired by
asking participants to clap their hands before starting, and
after finishing their meal (see [3] for a similar approach).
The clapping creates a distinct signature in both the video
recording and the accelerometer. All sensors were trimmed
in time and synchronized for each participant based on these
two reference points.
4) SCALES
In addition to detecting individual intake gestures, there is
also a growing body of research on determining the amounts
of food consumed based on continuous weight measurement
using scales [8], [22]. By detecting changes in the amounts of
food on a plate, the scale data can complement other modali-
ties in detecting intake and/or serving gestures as well as eval-
uating the amounts of food consumed from specific plates.
The shared plate setting in OREBA-SHA included four scales
that measured the weight of the two rice dishes at two corners
of the table as well as the wet dish and the vegetable dish in
the centre of the table (see Figure 2). These scales recorded
the weight of the four dishes in grams at a sampling rate
181958 VOLUME 8, 2020
P. V. Rouast et al.: OREBA: A Dataset for OREBA
FIGURE 4. Example of a labeled intake gesture with video, accelerometer, and gyroscope sensor data. For easier display, the video framerate has
been reduced.
TABLE 2. The labeling scheme.
of 1 Hz. The scale recordings were time-synchronized by
means of a time-lapse camera and a 200g calibration weight.
At the start of the recording, a research assistant removed the
calibration weight from the scale. This was captured by the
scale recordings as well as the time-lapse camera. Further,
the time-lapse camera also captured the clapping at the start of
the recording. Based on this, each scale recording includes a
reference in seconds to the clapping at the start of a recording.
Further, the dataset provides a mapping of each participant
number to the closest rice dish.
D. ANNOTATION
All relevant gestures were labeled and cross-checked by two
independent annotators using ChronoViz 8. Each gesture
includes a start and an end timestamp:
The start timestamp is the point where the final uninter-
rupted movement to execute the gesture starts;
the end timestamp is the point when the participant has
finished returning their hand(s) from the movement or
started a different gesture.
Additionally, each gesture is assigned four labels accord-
ing to our labeling scheme as listed in Table 2. Besides
the Main identification as an Intake or Serve gesture,
this scheme allows to further specify a Sub category for
each gesture, as well as indicating the Hand and Utensil
(e.g., <id>_annotations.csv).
8See http://chronoviz.com
The scheme is designed to be extendable with more cate-
gories in possible extensions of the dataset. The discrete dish
scenario OREBA-DIS includes Intake labels. Correspond-
ingly, the shared dish scenario OREBA-SHA includes both
Intake and Serve labels. Figure 4depicts an example of a
labeled intake gesture and associated IMU sensor data.
E. SPLITS
For machine learning problems with time-intensive training
and evaluation, the best practice is to train, validate, and test
using three separate sets of data [23]. Models are trained
with the training set, hyperparameters are tuned using the
validation set, and reported results are based on the test set.
We choose a split of approximately 3:1:1, such that each
participant only appears in one of the three subsets; this is to
ensure that we are measuring the model’s ability to generalize
and avoid data leakage. The recommended split is included in
the dataset download. Table 3summarize high-level statistics
on these splits.
F. DEMOGRAPHICS
Out of 180 participants in total, 161 agreed to complete a
demographics questionnaire. Across the dataset, 67% iden-
tified as male and 33% as female. The median age is 24,
with the minimum and maximum age being 18 and 54 years
respectively. Reported ethnicities in the dataset include
White Australian (52.2%), White other European (9.9%),
Chinese (8.7%), Other Asian (8.7%), Persian (5.6%), Arabic
(3.1%), White British (3.1%), African (2.5%), and South
East Asian (1.8%). About 10% reported being left-, and 90%
right-handed.
G. AVAI LABILIT Y
The OREBA dataset is available on request to research groups
at academic institutions. Please visit http://www.newcastle.
VOLUME 8, 2020 181959
P. V. Rouast et al.: OREBA: A Dataset for OREBA
TABLE 3. Summary statistics for our dataset and the training/validation/test split.
edu.au/oreba to download the data sharing agreement and get
access.
IV. BASELINE FOR INTAKE GESTURE DETECTION
Intake gesture detection refers to the task of detecting the
times of individual intake gestures from sensor data. Sim-
ilar to dataset papers in other areas [11], [12], we pro-
vide baseline results for this task on OREBA-DIS and
OREBA-SHA. We apply the two-stage approach proposed by
Kyritsis et al. [9] to estimate frame-level intake probabilities
and detect intake gestures. For this purpose, we train separate
baseline models on inertial and video sensor data, introduced
in Section IV-A. To ensure comparability with future stud-
ies, we use the publicly available data splits introduced in
Section III-E for training, validation, and test; we addition-
ally report details on training and used evaluation metric in
Section IV-B.
A. BASELINE MODELS
For each modality, we use one simple CNN and one more
complex model proposed in previous studies [20], [24].
As listed in Table 4, this results in a total of eight baseline
models, considering the different scenarios, modalities, and
models.
1) INERTIAL
The inertial models are taken from a recent study on OREBA
by Heydarian et al. [20]. We compare the simple CNN with
the more complex CNN-LSTM proposed in the aforemen-
tioned work. The simple CNN consists of seven CNN layers
and one fully-connected layer, with one max pooling layer
following each CNN layer. The CNN-LSTM consists of four
CNN layers with 128 kernels each, two LSTM layers with
64 units each, and one fully-connected layer. Full details on
these models are available in the Supplemental Material S1.
2) VIDEO
The video models are taken from a recent study on OREBA
by Rouast and Adam [24]. For our comparison we use the
simple CNN and the more complex ResNet-50 SlowFast
proposed by Rouast et al. While the simple CNN only uses
one frame at a time, the ResNet-50 SlowFast model uses
16 frames. The ResNet-50 SlowFast model consists of two
50-layer 3D CNNs which are fused with lateral connections
and spatially aligned 2D conv fusion. Full details on these
models are available in the Supplemental Material S2.
B. TRAINING AND EVALUATION METRICS
1) TRAINING
We train a total of eight baseline models. All baseline models
are trained using the Adam optimizer with an exponentially
decaying learning rate on the respective training dataset.
We use batch size 256 for inertial data, and batch sizes 8
(ResNet-50 SlowFast) / 64 (CNN) for video data. Model
selection is done using the validation set.
FIGURE 5. The evaluation scheme (proposed by [9]; figure from [24]
extended here). (1) A true positive is the first detection within each
ground truth event; (2) False positives of type 1 are further detections
within the same ground truth event; (3) False positives of type 2 are
detections outside ground truth events; (4) False positives of type 3 are
detections made for the wrong class; (5) False negatives are non-detected
ground truth events.
2) EVALUATION METRICS
We extend the evaluation scheme proposed by
Kyritsis et al. [9] as depicted in Figure 5. The scheme uses
the ground truth to translate sparse detections into measurable
metrics for a given label category. As Rouast and Adam [24]
report, one correct detection per ground truth event counts
as a true positive (TP), while further detections within the
same ground truth event are false positives of type 1 (FP1).
Detections outside ground truth events are false positives of
type 2 (FP2) and non-detected ground truth events count as
false negatives (FN). The scheme has been extended here to
support the multi-class case, where detections for a wrong
class are false positives of type 3. Based on the aggregate
counts, precision ( TP
TP+FP1+FP2+FP3), recall ( TP
TP+FN ), and the
F1score (2 PrecisionRecall
Precision+Recall ) can be calculated.
C. BASELINE RESULTS
Table 4reports the test set results for the aforementioned
models on both OREBA-DIS and OREBA-SHA.
1) INERTIAL
Heydarian et al. [20] ran multiple experiments benchmarking
different deep learning models and pre-processing pipelines.
The top model performance was achieved by a CNN-LSTM
with earliest fusion through a dedicated CNN layer and
181960 VOLUME 8, 2020
P. V. Rouast et al.: OREBA: A Dataset for OREBA
TABLE 4. Baseline test set results for intake gesture detection. On OREBA-DIS, the video model performs better than the inertial model, while the
opposite is true on OREBA-SHA. This indicates that the test set for OREBA-DIS is more challenging for using inertial data, while the test set for
OREBA-SHA is more challenging for using video data.
target matching. Concerning preprocessing, their results
show that applying a consecutive combination of mirroring,
removing the gravity effect, and standardization was bene-
ficial for model performance, while smoothing had adverse
effects.
From the results in Table 4, it appears that the test set for
OREBA-DIS (F1=0.778) is more challenging for inertial
data than the test set for OREBA-SHA (F1=0.852). Com-
paring the simple CNN with the more advanced CNN-LSTM
approach, we find that the more advanced CNN-LSTM add
relative improvements of 2.6% (OREBA-DIS) and 1.5%
(OREBA-SHA) over the simple CNN.
2) VIDEO
Rouast and Adam [24] applied several deep learning archi-
tectures established in the literature on video action recog-
nition on the task of detecting intake gestures directly from
the video data in OREBA-DIS. The best test set result was
achieved using a SlowFast [25] network with ResNet-50 [26]
as backbone. Further conclusions from the experiments are
that appearance features are more useful than motion features,
and that temporal context in form of multiple video frames is
essential for top model performance.
The results in Table 4indicate that the test set for
OREBA-SHA (F1=0.808) is more challenging when
working with video data than the test set for OREBA-DIS
(F1=0.853). Comparing results between the models,
we find that the more advanced ResNet-50 SlowFast adds
relative improvements of 19.6% (OREBA-DIS) and 16.1%
(OREBA-SHA) over the simple CNN.
V. DISCUSSION
In this paper, we have introduced the OREBA dataset,
which provides a comprehensive multi-sensor recording with
labeled gestures of communal intake occasions from dis-
crete and shared meals. Building on a summary of related
work on automatic dietary monitoring and an overview of
existing public datasets in the field, we provided details
on the data collection, sensor processing, and annota-
tion methods employed in the creation of the OREBA
dataset. Additionally, we reported baseline results on the
task of intake gesture detection based on video and inertial
sensor data.
Sensor-based, passive methods of dietary monitoring have
the potential of complementing existing active methods such
as food records and 24-hr recall. As seen in other fields
such as object recognition [11] and action recognition [10],
progress in the research of machine learning methods is
tightly linked to the availability and ongoing development of
datasets with labeled examples. In this light, we hope that the
OREBA dataset will be able to support future developments
in automatic dietary monitoring. Compared to existing public
datasets of labeled intake gestures, the OREBA dataset is
unique as it is (i) multimodal with synchronized frontal video
data based on spherical video recordings, and inertial data
from both hands at 64 Hz, and (ii) includes a total amount of
202 recordings in two different communal eating scenarios.
As the first intake gesture detection dataset that also makes
the video recordings available, OREBA enables researchers
to independently verify, extend, and update the provided
data annotations. This further increases the transparency and
reliability of the dataset, and the machine learning models
building on it.
While we have reported results from our initial studies on
detecting intake gestures with either video or inertial sensor
data, there are also several other directions that research on
this dataset could go into. For instance, the OREBA dataset
can be used to compare and contrast CNN-LSTM [9], [24]
and CNN-GRU [17] models. Improvements could also be
made by using transfer learning between the two different
scenarios, allowing to contrast the difficulties of monitoring
discrete versus shared dishes. Thanks to the availability of
inertial data for both hands, studies could also explore how
much information is retained on the dominant versus the
non-dominant hand, which has implications for automatic
dietary monitoring using commercial smartwatches. Further,
while our initial studies focused on detecting individual
intake gestures, future studies could explore the other label
categories – for example, how well the video or inertial
modalities perform at distinguishing between different uten-
sils. Finally, sensor fusion of video and IMU is a further
possibility that could be explored in the future. As such, and
VOLUME 8, 2020 181961
P. V. Rouast et al.: OREBA: A Dataset for OREBA
beyond researchers specifically interested in dietary monitor-
ing, the OREBA dataset could also be a valuable resource for
researchers interested in advancing machine learning models
for sensor fusion more broadly.
VI. CONCLUSION
Publicly available datasets are an important resource to
fuel advances in machine learning [11], [12]. In this paper,
we have introduced a comprehensive multi-sensor recording
dataset with labeled gestures of communal intake occasions
from discrete and shared meals. To the best of our knowledge,
this is the first dataset for intake gesture detection that pro-
vides synchronized data in the form of both frontal video and
inertial sensor data from both hands. By making this dataset
publicly available to the research community, OREBA has the
potential to advance research and foster innovation in this area
as it allows researchers to objectively benchmark existing
and emerging machine learning approaches, and reducing the
burden for researchers to collect and annotate their own data.
ACKNOWLEDGMENT
The authors thank the untiring support of Clare Cummings,
Grace Manning, Alice Melton, Kaylee Slater, Felicity Steel,
and Sam Stewart in collecting and annotating the data.
REFERENCES
[1] S. W. Lichtman, K. Pisarska, E. R. Berman, M. Pestone, E. Offenbacher,
H. Weisel, S. Heshka, D. E. Matthews, H. Dowling, and S. B. Heymsfield,
‘‘Discrepancy between self-reported and actual caloric intake and exercise
in obese subjects,’New England J. Med., vol. 327, no. 27, pp. 1893–1898,
Dec. 1992.
[2] P. V. Rouast, M. T. P. Adam, T. Burrows, R. Chiong, and M. E. Rollo,
‘‘Using deep learning and 360 video to detect eating behavior for user
assistance systems,’’ in Proc. Eur. Conf. Inf. Syst., 2018, pp. 1–11.
[3] K. Kyritsis, C. Diou, and A. Delopoulos, ‘‘Food intake detection from
inertial sensors using LSTM networks,’’ in Proc. Int. Conf. Image Anal.
Process., 2017, pp. 411–418.
[4] J. Qiu, F. P.-W. Lo, and B. Lo, ‘‘Assessing individual dietary intake in food
sharing scenarios with a 360 camera and deep learning,’’ in Proc. IEEE
16th Int. Conf. Wearable Implant. Body Sensor Netw. (BSN), May 2019,
pp. 1–4.
[5] D. Konstantinidis, K. Dimitropoulos, B. Langlet, P. Daras, and
I. Ioakimidis, ‘‘Validation of a deep learning system for the full
automation of bite and meal duration analysis of experimental meal
videos,’Nutrients, vol. 12, no. 209, pp. 1–16, 2020.
[6] T. Burrows, C. Collins, M. T. P. Adam, K. Duncanson, and M. Rollo,
‘‘Dietary assessment of shared plate eating: A missing link,’’ Nutrients,
vol. 11, no. 4, pp. 1–14, 2019.
[7] C. Merck, C. Maher, M. Mirtchouk, M. Zheng, Y. Huang, and S. Kleinberg,
‘‘Multimodality sensing for eating recognition,’’ in Proc. 10th EAI Int.
Conf. Pervasive Comput. Technol. Healthcare, 2016, pp. 130–137.
[8] Y. Shen, J. Salley, E. Muth, and A. Hoover, ‘‘Assessing the accuracy of
a wrist motion tracking method for counting bites across demographic
and food variables,’IEEE J. Biomed. Health Inform., vol. 21, no. 3,
pp. 599–606, May 2017.
[9] K. Kyritsis, C. Diou, and A. Delopoulos, ‘‘Modeling wrist micromove-
ments to measure in-meal eating behavior from inertial sensor data,’IEEE
J. Biomed. Health Inform., vol. 23, no. 6, pp. 2325–2334, Nov. 2019.
[10] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017,
arXiv:1705.06950. [Online]. Available: http://arxiv.org/abs/1705.06950
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[12] A. Mollahosseini, B. Hasani, and M. H. Mahoor, ‘‘Affectnet: A database
for facial expression, valence, and arousal computing in the wild,’’ IEEE
Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, Jan./Mar. 2019.
[13] H. Heydarian, M. Adam, T. Burrows, C. Collins, and M. E. Rollo, ‘‘Assess-
ing eating behaviour using upper limb mounted motion sensors: A system-
atic review,’Nutrients, vol. 11, no. 1168, pp. 1–25, 2019.
[14] S. Zhang, D. T. Nguyen, G. Zhang, R. Xu, N. Maglaveras, and
N. Alshurafa, ‘‘Estimating caloric intake in bedridden hospital patients
with audio and neck-worn sensors,’’ in Proc. IEEE/ACM Int. Conf. Con-
nected Health, Appl., Syst. Eng. Technol., Sep. 2018, pp. 1–2.
[15] S. Zhang, W. Stogin, and N. Alshurafa, ‘‘I sense overeating: Motif-based
machine learning framework to detect overeating using wrist-worn sens-
ing,’Inf. Fusion, vol. 41, pp. 37–47, May 2018.
[16] K. Kyritsis, C. Diou, and A. Delopoulos, ‘‘A data driven end-to-end
approach for in-the-wild monitoring of eating behavior using smart-
watches,’IEEE J. Biomed. Health Inform., early access, Apr. 3, 2020,
doi: 10.1109/JBHI.2020.2984907.
[17] H. Zhu, H. Chen, and R. Brown, ‘‘A sequence-to-sequence model-based
deep learning approach for recognizing activity of daily living for senior
care,’J. Biomed. Informat., vol. 84, pp. 148–158, Aug. 2018.
[18] P. V. Rouast, M. Adam, and R. Chiong, ‘‘Deep learning for human affect
recognition: Insights and new developments,’’ IEEE Trans. Affect. Com-
put., early access, Jan. 1, 2019, doi: 10.1109/TAFFC.2018.2890471.
[19] S. Hantke, F. Weninger, R. Kurle, F. Ringeval, A. Batliner, A. E.-D. Mousa,
and B. Schuller, ‘‘I hear you eat and speak: Automatic recognition of eating
condition and food type, use-cases, and impact on ASR performance,’
PLoS ONE, vol. 11, no. 5, pp. 1–24, 2016.
[20] H. Heydarian, P. V. Rouast, M. T. P. Adam, T. Burrows, and M. E. Rollo,
‘‘Deep learning for intake gesture detection from wrist-worn inertial sen-
sors: The effects of data preprocessing, sensor modalities, and sensor
positions,’IEEE Access, vol. 8, pp. 164936–164949, 2020.
[21] S. Madgwick, ‘‘An efficient orientation filter for inertial and iner-
tial/magnetic sensor arrays,’’ Univ. Bristol, Bristol, U.K., Tech. Rep. 25,
2010.
[22] V. Papapanagiotou, C. Diou, I. Ioakimidis, P. Sodersten, and
A. Delopoulos, ‘‘Automatic analysis of food intake and meal
microstructure based on continuous weight measurements,’IEEE
J. Biomed. Health Informat., vol. 23, no. 2, pp. 893–902, Mar. 2019.
[23] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
MA, USA: MIT Press, 2016.
[24] P. V. Rouast and M. T. P. Adam, ‘‘Learning deep representations for video-
based intake gesture detection,’IEEE J. Biomed. Health Inform., vol. 24,
no. 6, pp. 1727–1737, Jun. 2020.
[25] C. Feichtenhofer, H. Fan, J. Malik, and K. He, ‘‘SlowFast networks
for video recognition,’’ 2018, arXiv:1812.03982. [Online]. Available:
http://arxiv.org/abs/1812.03982
[26] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.
PHILIPP V. ROUAST (Member, IEEE) received
the B.Sc. and M.Sc. degrees in industrial engi-
neering from the Karlsruhe Institute of Technol-
ogy, Germany, in 2013 and 2016, respectively.
He is currently pursuing the Ph.D. degree in infor-
mation systems with The University of Newcas-
tle, Australia. He is also a Graduate Research
Assistant with The University of Newcastle. His
research interests include deep learning, affec-
tive computing, HCI, and related applications of
computer vision.
HAMID HEYDARIAN received the B.Sc.
degree in computer engineering (software) from
Kharazmi University, Iran, in 2002. He is cur-
rently pursuing the Ph.D. degree in information
technology with The University of Newcastle,
Australia. He is a Senior Software Developer. He is
also a Casual Academic with The University of
Newcastle. His research interests include iner-
tial signal processing using deep learning and its
related applications in dietary intake assessment
and passive dietary monitoring.
181962 VOLUME 8, 2020
P. V. Rouast et al.: OREBA: A Dataset for OREBA
MARC T. P. ADAM received the undergradu-
ate degree in computer science from the Uni-
versity of Applied Sciences Würzburg, Germany,
and the Ph.D. degree in information systems from
the Karlsruhe Institute of Technology, Germany.
He is currently an Associate Professor of com-
puting and information technology with The
University of Newcastle, Australia. His research
interests include users’ cognition and affect in
human–computer interaction. He is a Founding
Member of the Society for NeuroIS.
MEGAN E. ROLLO received the B.App.Sci.,
B.Hlthh.Sci. (Nutr&Diet), and Ph.D. degrees
from the Queensland University of Technology,
Australia. She is currently a Research Fellow in
nutrition and dietetics with the Priority Research
Centre for Physical Activity and Nutrition, School
of Health Sciences, The University of New-
castle, Australia. Her research interests include
technology-assisted dietary assessment and per-
sonalized behavioral nutrition interventions.
VOLUME 8, 2020 181963
... They established a twostage detection scheme [4] to first identify intake frames and next detect intake gestures. Heydarian et al. [6] adopted this approach and proposed an inertial model that outperformed existing intake gesture detection models on the publicly available, multimodal OREBA-DIS dataset [7]. Rouast et al. [1] also adopted the two-stage detection scheme and compared different deep learning approaches for intake gestures detection using video data from the same dataset. ...
... With respect to fusion, we compare scorelevel fusion (i.e., fusion of the probability outputs) and decision-level fusion (i.e., fusion of the decision outputs). We conduct our experiments on the publicly available multimodal OREBA-DIS (discrete dish, 100 participants) and OREBA-SHA (shared dish, 102 participants) [7] from the OREBA datasets with 180 unique participants in total. The contributions of this study are as follows: ...
... Intake gesture detection from data recorded by wearable inertial sensors has been explored since 2005 [13] using different machine learning algorithms [2]. Since 2017 [14], deep learning (e.g., [3], [5], [6]) has been used to improve intake gesture detection, particularly with the availability of annotated datasets (e.g., [4], [7], [15]). Especially recurrent neural networks (RNNs) with their ability to take the previous states of data into account [16] (e.g., [17]- [19]), have recently been used to model the temporal context of inertial and video data (e.g., [1], [4], [6]). ...
Article
Full-text available
Recent research has employed deep learning to detect intake gestures from inertial sensor and video camera data. However, the fusion of these modalities has not been attempted. The present research explores the potential of fusing the outputs of two individual deep learning inertial and video intake gesture detection models (i.e., score-level and decision-level fusion) using the test sets from two publicly available multimodal datasets: (1) OREBA-DIS recorded from 100 participants while consuming food served in discrete portions and (2) OREBA-SHA recorded from 102 participants while consuming a communal dish. We first assess the potential of fusion by contrasting the performance of the individual models in intake gesture detection. The assessment shows that fusing the outputs of individual models is more promising on the OREBA-DIS dataset. Subsequently, we conduct experiments using different score-level and decision-level fusion approaches. Our results from fusion show that the score-level fusion approach of max score model performs best of all considered fusion approaches. On the OREBA-DIS dataset, the max score fusion approach (F1 = 0.871) outperforms both individual video (F1 = 0.855) and inertial (F1 = 0.806) models. However, on the OREBA-SHA dataset, the max score fusion approach (F1 = 0.873) fails to outperform the individual inertial model (F1 = 0.895). Pairwise comparisons using bootstrapped samples confirm the statistical significance of these differences in model performance (p<.001).
... More recent developments include the use of machine learning to learn features automatically [5] and learning from video, which has become more practical with emerging spherical camera technology [6] [7]. Research on the OREBA dataset showed that frontal video data can exhibit even higher accuracies in detecting eating gestures than inertial data [8]. ...
... C. Datasets 1) OREBA: The OREBA dataset [8] includes inertial and video data. This dataset was approved by the IRB at The University of Newcastle on 10 September 2017 (H-2017-0208). ...
... Specifically, we use the OREBA-DIS scenario with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. The split suggested by the dataset authors [8] includes training, validation, and test sets of 61, 20, and 19 participants. For the inertial models, we use the processed 6 accelerometer and gyroscope data from both wrists at 64 Hz (8 seconds correspond to 512 frames). ...
Article
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) framelevel intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative F1 score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.
... More recent developments include the use of machine learning to learn features automatically [5] and learning from video, which has become more practical with emerging spherical camera technology [6] [7]. Research on the OREBA dataset showed that frontal video data can exhibit even higher accuracies in detecting eating gestures than inertial data [8]. ...
... C. Datasets 1) OREBA: The OREBA dataset [8] includes both inertial and video data. Specifically, we are using the scenario OREBA-DIS with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. ...
... Specifically, we are using the scenario OREBA-DIS with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. Data are split into training, validation, and test sets of 61, 20, and 19 participants according to the split suggested by the dataset authors [8]. For our inertial models, we use the processed 5 data from accelerometer and gyroscope readings for both wrists at 64 Hz. ...
Preprint
Full-text available
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.
... A clinic's doctors or their assistants can use the inertial sensors to track patients' motions to assist in rehabilitation or disease diagnosis [9]- [11]. Some studies have collected data from many participants wearing IMUs during everyday life for an action recognition task [12]- [14]. ...
Article
Full-text available
Due to the recent technological advances in inertial measurement units (IMUs), many applications for the measurement of human motion using multiple body-worn IMUs have been developed. In these applications, each IMU has to be attached to a predefined body segment. A technique to identify the body segment on which each IMU is mounted allows users to attach inertial sensors to arbitrary body segments, which avoids having to remeasure due to incorrect attachment of the sensors. We address this IMU-to-segment assignment problem and propose a novel end-to-end learning model that incorporates a global feature generation module and an attention-based mechanism. The former extracts the feature representing the motion of all attached IMUs, and the latter enable the model to learn the dependency relationships between the IMUs. The proposed model thus identifies the IMU placement based on the features from global motion and relevant IMUs. We quantitatively evaluated the proposed method using synthetic and real public datasets with three sensor configurations, including a full-body configuration mounting 15 sensors. The results demonstrated that our approach significantly outperformed the conventional and baseline methods for all datasets and sensor configurations.
... The intake monitoring data are annotated with hand, utensil, container, and food. The dataset for Objectively Recognizing Eating Behaviour and Associated Intake (OREBA) is dedicated for IMU data of both hands synchronized with frontal camera on 180 participants during food intake [186]. All of these datasets include videos for ground truth, albeit the unavailability to the public. ...
Article
This comprehensive review mainly analyzes and summarizes the recently published works on IEEExplore in sensor-driven smart living contexts. We have gathered over 150 research papers, especially in the past five years. We categorize them into four major research directions: activity tracker, affective computing, sleep monitoring, and ingestive behavior. We report each research direction’s summary by following our defined sensor types: biomedical sensors, mechanical sensors, non-contact sensors, and others. Furthermore, the review behaves as one-stop service literature appropriate for novices who intend to research the direction of sensor-driven applications towards smart living. In conclusion, the state-of-the-art works, the publicity available data sources, and the future challenge issues (sensor selection, algorithms, and privacy) are the major contributions of this proposed article.
Article
Full-text available
Wrist-worn inertial measurement units have emerged as a promising technology to passively capture dietary intake data. State-of-the-art approaches use deep neural networks to process the collected inertial data and detect characteristic hand movements associated with intake gestures. In order to clarify the effects of data preprocessing, sensor modalities, and sensor positions, we collected and labeled inertial data from wrist-worn accelerometers and gyroscopes on both hands of 100 participants in a semi-controlled setting. The method included data preprocessing and data segmentation, followed by a two-stage approach. In Stage 1, we estimated the probability of each inertial data frame being intake or non-intake, benchmarking different deep learning models and architectures. Based on the probabilities estimated in Stage 1, we detected the intake gestures in Stage 2 and calculated the F1 score for each model. Results indicate that top model performance was achieved by a CNN-LSTM with earliest sensor data fusion through a dedicated CNN layer and a target matching technique (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> = .778). As for data preprocessing, results show that applying a consecutive combination of mirroring, removing gravity effect, and standardization was beneficial for model performance, while smoothing had adverse effects. We further investigate the effectiveness of using different combinations of sensor modalities (i.e., accelerometer and/or gyroscope) and sensor positions (i.e., dominant intake hand and/or non-dominant intake hand).
Article
Full-text available
The increased worldwide prevalence of obesity has sparked the interest of the scientific community towards tools that objectively and automatically monitor eating behavior. Despite the study of obesity being in the spotlight, such tools can also be used to study eating disorders (e.g. anorexia nervosa) or provide a personalized monitoring platform for patients or athletes. This paper presents a complete framework towards the automated i) modeling of in-meal eating behavior and ii) temporal localization of meals, from raw inertial data collected in-the-wild using commercially available smartwatches. Initially, we present an end-to-end Neural Network which detects food intake events (i.e. bites). The proposed network uses both convolutional and recurrent layers that are trained simultaneously. Subsequently, we show how the distribution of the detected bites throughout the day can be used to estimate the start and end points of meals, using signal processing algorithms. We perform extensive evaluation on each framework part individually. Leave-one-subject-out (LOSO) evaluation shows that our bite detection approach outperforms four state-of-the-art algorithms towards the detection of bites during the course of a meal (0.923 F1 score). Furthermore, LOSO and held-out set experiments regarding the estimation of meal start/end points reveal that the proposed approach outperforms a relevant approach found in the literature (Jaccard Index of 0.820 and 0.821 for the LOSO and held-out experiments, respectively). Experiments are performed using our publicly available FIC and the newly introduced FreeFIC datasets.
Article
Full-text available
Eating behavior can have an important effect on, and be correlated with, obesity and eating disorders. Eating behavior is usually estimated through self-reporting measures, despite their limitations in reliability, based on ease of collection and analysis. A better and widely used alternative is the objective analysis of eating during meals based on human annotations of in-meal behavioral events (e.g., bites). However, this methodology is time-consuming and often affected by human error, limiting its scalability and cost-effectiveness for large-scale research. To remedy the latter, a novel “Rapid Automatic Bite Detection” (RABiD) algorithm that extracts and processes skeletal features from videos was trained in a video meal dataset (59 individuals; 85 meals; three different foods) to automatically measure meal duration and bites. In these settings, RABiD achieved near perfect agreement between algorithmic and human annotations (Cohen’s kappa κ = 0.894; F1-score: 0.948). Moreover, RABiD was used to analyze an independent eating behavior experiment (18 female participants; 45 meals; three different foods) and results showed excellent correlation between algorithmic and human annotations. The analyses revealed that, despite the changes in food (hash vs. meatballs), the total meal duration remained the same, while the number of bites were significantly reduced. Finally, a descriptive meal-progress analysis revealed that different types of food affect bite frequency, although overall bite patterns remain similar (the outcomes were the same for RABiD and manual). Subjects took bites more frequently at the beginning and the end of meals but were slower in-between. On a methodological level, RABiD offers a valid, fully automatic alternative to human meal-video annotations for the experimental analysis of human eating behavior, at a fraction of the cost and the required time, without any loss of information and data fidelity.
Article
Full-text available
Automatic detection of individual intake gestures during eating occasions has the potential to improve dietary monitoring and support dietary recommendations. Existing studies typically make use of on-body solutions such as inertial and audio sensors, while video is used as ground truth. Intake gesture detection directly based on video has rarely been attempted. In this study, we address this gap and show that deep learning architectures can successfully be applied to the problem of video-based detection of intake gestures. For this purpose, we collect and label video data of eating occasions using 360-degree video of 102 participants. Applying state-of-the-art approaches from video action recognition, our results show that (1) the best model achieves an F1 score of 0.858, (2) appearance features contribute more than motion features, and (3) temporal context in form of multiple video frames is essential for top model performance.
Article
Full-text available
Wearable motion tracking sensors are now widely used to monitor physical activity, and have recently gained more attention in dietary monitoring research. The aim of this review is to synthesise research to date that utilises upper limb motion tracking sensors, either individually or in combination with other technologies (e.g., cameras, microphones), to objectively assess eating behaviour. Eleven electronic databases were searched in January 2019, and 653 distinct records were obtained. Including 10 studies found in backward and forward searches, a total of 69 studies met the inclusion criteria, with 28 published since 2017. Fifty studies were conducted exclusively in laboratory settings, 13 exclusively in free-living settings, and three in both settings. The most commonly used motion sensor was an accelerometer (64) worn on the wrist (60) or lower arm (5), while in most studies (45), accelerometers were used in combination with gyroscopes. Twenty-six studies used commercial-grade smartwatches or fitness bands, 11 used professional grade devices, and 32 used standalone sensor chipsets. The most used machine learning approaches were Support Vector Machine (SVM, n = 21), Random Forest (n = 19), Decision Tree (n = 16), Hidden Markov Model (HMM, n = 10) algorithms, and from 2017 Deep Learning (n = 5). While comparisons of the detection models are not valid due to the use of different datasets, the models that consider the sequential context of data across time, such as HMM and Deep Learning, show promising results for eating activity detection. We discuss opportunities for future research and emerging applications in the context of dietary assessment and monitoring.
Article
Full-text available
Overweight and obesity are both associated with in-meal eating parameters such as eating speed. Recently, the plethora of available wearable devices in the market ignited the interest of both the scientific community and the industry towards unobtrusive solutions for eating behavior monitoring. In this paper we present an algorithm for automatically detecting the in-meal food intake cycles using the inertial signals (acceleration and orientation velocity) from an off-the-shelf smartwatch. We use 5 specific wrist micromovements to model the series of actions leading to and following an intake event (i.e. bite). Food intake detection is performed in two steps. In the first step we process windows of raw sensor streams and estimate their micromovement probability distributions by means of a Convolutional Neural Network (CNN). In the second step we use a Long-Short Term Memory (LSTM) network to capture the temporal evolution and classify sequences of windows as food intake cycles. Evaluation is performed using a challenging dataset of 21 meals from 12 subjects. In our experiments we compare the performance of our algorithm against three state-of-the-art approaches, where our approach achieves the highest F1 detection score (0.913 in the Leave-One-Subject-Out experiment). The dataset used in the experiments is available at https://mug.ee.auth.gr/intake-cycle-detection/.
Article
Shared plate eating is a defining feature of the way food is consumed in some countries and cultures. Food may be portioned to another serving vessel or directly consumed into the mouth from a centralised dish rather than served individually onto a discrete plate for each person. Shared plate eating is common in some low-and lower-middle income countries (LLMIC). The aim of this narrative review was to synthesise research that has reported on the assessment of dietary intake from shared plate eating, investigate specific aspects such as individual portion size or consumption from shared plates and use of technology in order to guide future development work in this area. Variations of shared plate eating that were identified in this review included foods consumed directly from a central dish or shared plate food, served onto additional plates shared by two or more people. In some settings, a hierarchical sharing structure was reported whereby different family members eat in turn from the shared plate. A range of dietary assessment methods have been used in studies assessing shared plate eating with the most common being 24-h recalls. The tools reported as being used to assist in the quantification of food intake from shared plate eating included food photographs, portion size images, line drawings, and the carrying capacity of bread, which is often used rather than utensils. Overall few studies were identified that have assessed and reported on methods to assess shared plate eating, highlighting the identified gap in an area of research that is important in improving understanding of, and redressing dietary inadequacies in LLMIC.