PreprintPDF Available

OREBA: A Dataset for Objectively Recognizing Eating Behaviour and Associated Intake

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of the labelled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as technical details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
OREBA: A Dataset for Objectively
Recognizing Eating Behaviour and
Associated Intake
PHILIPP V. ROUAST1, (Student Member, IEEE), HAMID HEYDARIAN1, MARC T. P. ADAM1,3,
and MEGAN E. ROLLO2,3
1School of Electrical Engineering and Computing, The University of Newcastle, Callaghan, NSW 2308, Australia
2School of Health Sciences, The University of Newcastle, Callaghan, NSW 2308, Australia
3Priority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Callaghan, NSW 2308, Australia
Corresponding author: Marc T. P. Adam (e-mail: marc.adam@newcastle.edu.au).
We gratefully acknowledge the support by the Bill & Melinda Gates Foundation [OPP1171389]. Philipp Rouast and Hamid Heydarian
were supported by an Australian Government Research Training (RTP) Scholarship.
ABSTRACT Automatic detection of intake gestures is a key element of automatic dietary monitoring.
Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for
this purpose. The common machine learning approaches make use of the labeled sensor data to automatically
learn how to make detections. One characteristic, especially for deep learning models, is the need for large
datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated
Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded
during the course of communal meals for researchers interested in intake gesture detection. Two scenarios
are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069
intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer
and gyroscope for both hands. We report the details of data collection and annotation, as well as details of
sensor processing. The results of studies on IMU and video data involving deep learning models are reported
to provide a baseline for future research. Specifically, the best baseline models achieve performances of
F1= 0.853 for the discrete dish using video and F1= 0.852 for the shared dish using inertial data.
INDEX TERMS Dietary monitoring, eating behaviour assessment, accelerometer, communal eating,
gyroscope, 360-degree video camera
I. INTRODUCTION
TRADITIONAL dietary assessment methods are reliant
on self-report data. While data captured with active
methods such as self-report and 24-hr recall are widely used
in practice, they are not without limitations (e.g., human
error, time-consuming manual process) [1]. Automatic di-
etary monitoring, where data is collected and processed in-
dependent of the individual, has the potential to complement
data from traditional methods and reduce associated biases
[2]. In addition, such systems have the potential to support
personal self-monitoring solutions by providing individuals
with targeted eating behaviour recommendations.
A key element of automatic dietary monitoring is the
detection of intake gestures (i.e., the process of moving food
or drink towards the mouth). Recent research on this task
focuses mainly on machine learning approaches which are
characterized by a need for large amounts of labeled data.
This is especially true in conjunction with deep learning,
which has been applied in this context since 2017 [3]. How-
ever, collecting, synchronizing, and labeling data of eating
occasions is a work-intensive process. Hence, there is a need
for more public datasets to reduce barriers for researchers
to create new machine learning models, and to objectively
compare the performance of existing approaches [4], [5].
At the same time, current research on dietary monitoring
identified a gap in research on shared plate eating [6]. Com-
munal eating (i.e., eating occasions involving more than one
person) is not yet well understood, let alone the impact it has
on accuracy of automatic dietary monitoring. Hence, existing
research on capturing dietary intake from discrete dishes
needs to be complemented and contrasted with research on
the detection of intake from shared dishes.
In order to address these gaps, the present paper introduces
the Objectively Recognizing Eating Behavior and Associated
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1. Public datasets of intake gestures with synchronized sensor data and annotations available.
Synchronized sensors Ground Participants Intake
Dataset Video Audio IMU Scale truth (Recordings)aevents Annotations
ACE [7] - Earbud Both hands -Video 7 (13) 1492 Chews, swallows with type and amount
15 Hz multibof food and drink.
Clemson [8] - - Dominant hand Tray Video 264 (488) 20644 Intake gestures, utensiling. Intake annotated
15 Hz 15 Hz ceilingbwith hand, utensil, container, and food.
FIC [9] - - Dominant hand -Video 12 (21) 1332 Plain intake gestures; further
100 Hz frontalbannotation of micromovements.
OREBA-DIS Frontal -Both hands -Video 100 (100) 4790 Intake gestures annotated with
24 fps 64 Hz frontal eat/drink, hand and utensil.
OREBA-SHA Frontal -Both hands Communal Video 102 (102) 4279 Intake gestures annotated with
30 fps 64 Hz 1 Hz multi eat/drink, hand and utensil.
aOne recordings equals one person consuming one meal.
bUsed for ground truth annotation, but not available for download as of early 2020.
Intake (OREBA) dataset. The goal of OREBA is to facili-
tate the automatic detection of intake gestures in communal
eating across two scenarios (discrete dish and shared dish).
By creating this dataset and making it available to the wider
research community, this paper makes four key contributions:
1) Large-scale dataset: We conducted a total of 202
meal recordings, with 180 unique individuals partic-
ipating who consented to their data being used by
other research institutions. In total, we captured 9069
intake gestures from discrete and shared dishes. Two
independent annotators labeled and cross-checked the
intake gestures.
2) Public availability: Progress in the research of ma-
chine learning methods is tightly linked to the public
availability of labeled datasets (e.g. [10], [11], [12] the
dataset publicly available to researchers, OREBA can
be used to objectively benchmark existing and emerg-
ing machine learning approaches. At the same time, it
reduces the time-consuming burden for researchers to
collect, annotate and cross-check their own data.
3) Communal eating: While existing research has pro-
vided important insights into automatically detecting
human intake gestures in individual settings, research
on communal eating is scant [13] [6]. To the best of our
knowledge, this is the first dataset capturing communal
eating from both discrete and shared dishes.
4) Multiple modalities: The dataset includes synchro-
nized frontal video and inertial measurement unit
(IMU) sensor data from both hands, along with labels
for each intake gesture. A single spherical camera
positioned in the center of the table made it possible to
capture the entire communal eating scene of up to four
participants, offering a full view of all relevant ges-
tures. While existing inertial datasets on intake gesture
detection often use video as ground truth, none of the
existing datasets currently include video data as part of
the synchronized sensor data for analysis.
In the following, Section II gives an overview of the related
work and existing datasets, Section III introduces the data
collection and annotation process of the OREBA dataset in
detail, Section IV provides results from our initial studies as
baselines. Finally, we provide a discussion in Section V and
conclusions in Section VI.
II. RELATED WORK
A. AUTOMATIC DIETARY MONITORING
Automatic dietary monitoring encompasses three major
goals: (i) detecting the timing of intake events, (ii) recog-
nizing the type of food or drink, and (iii) estimating the
weight consumed. Detection of intake behavior, which is
associated with intake gestures, chews, and swallows, can be
considered as part of the first goal. Researchers have lever-
aged various different sensor types for this purpose. While
chews and swallows can be detected using audio signals
[14], intake gestures are typically handled using an IMU
including accelerometer and gyroscope sensors [15] [13].
Before the application of deep learning architectures, the
traditional approach in this field reduced the dimensionality
of the raw sensor data by extracting handcrafted features
based on expert knowledge. Deep learning methods have
been explored to detect individual intake gestures with iner-
tial sensor data since 2017 [3] and with video data since 2018
[2], [4], [5], whereby large amounts of labeled examples are
leveraged to let algorithms learn the features automatically.
The most widely used approach in this space builds on convo-
lutional neural networks (CNN) and long short-term memory
(LSTM) models [16], however gated recurrent unit (GRU)
models have also been applied, especially in the context of
activity recognition in daily living [17].
B. EXISTING DATASETS
To date, most published studies on recognition of intake
behaviour rely on dedicated, private datasets collected for
a specific purpose. Considering the shift towards adoption
of deep learning techniques, we expect an increasing need
for large, public datasets that existing and emerging machine
learning approaches can objectively be benchmarked on.
Similar developments can be observed across several related
fields such as action recognition [10], affect recognition [18]
[12], and object recognition [11].
Table 1 provides an overview of publicly available datasets
on intake gestures which feature synchronized sensor data of
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Record
360º
Remap
Crop Crop Crop Crop
FIGURE 1. The spherical video is remapped to equirectangular representation, cropped, and reshaped to square shape.
FIGURE 2. Study setup for OREBA-SHA. One camera in the center of the
table, IMU on each wrist, and four scales.
eating occasions with labels for individual gestures (intake
gestures or other eating related gestures)1.
The accelerometer and audio-based calorie estimation
(ACE) dataset2[7] contains seven participants with au-
dio and IMU data for both hands and the head. Annota-
tions of type and amount of food and drink are available
for chews and swallows.
The Clemson Cafeteria dataset3[8] contains 264 par-
ticipants and 488 recordings. IMU data is available at
15 Hz for the dominant hand, along with scale measure-
ments for the tray. Each intake gesture is annotated with
hand, utensil, container, and food.
The Food Intake Cycle (FIC) dataset4[9], which con-
sists of 12 participants and 21 recordings, includes
IMU data for the dominant hand. The focus is on the
micromovements during intake gestures.
While video is commonly used as ground truth, none of
the existing datasets currently include video data as part of
the synchronized sensor data for analysis. In terms of IMU
data and quantity of recorded intake events, we find that the
1A related dataset is iHEARu-EAT [19], however we did not include it
here since it does not focus on intake events.
2See http://www.skleinberg.org/data.html
3See http://cecas.clemson.edu/~ahoover/cafeteria/. Recordings with miss-
ing annotations are excluded here.
4See https://mug.ee.auth.gr/intake-cycle-detection/
existing datasets are restricted either to data from only one
hand, a relatively low recording frequency (15 Hz), or few
participants. We aim to further the field by establishing the
OREBA dataset, which includes video and IMU from both
hands, at a quantity of intake events sufficient to train deep
learning models for both video and inertial modalities.
III. THE OREBA DATASET
The OREBA dataset aims to provide a comprehensive multi-
sensor recording of communal intake occasions for re-
searchers interested in automatic detection of intake gestures
and other behaviours associated with intake (e.g., serving
food onto a plate). Available sensor data consists of synchro-
nized frontal video and accelerometer and gyroscope for both
hands in two different scenarios (i.e., discrete dish and shared
dish). IRB approval was given (H-2017- 0208), and the data
was recorded between Mar 2018 and Oct 2019.
A. SCENARIOS
The OREBA dataset consists of two separate communal
eating scenarios. In each scenario, groups of up to four
participants were simultaneously recorded consuming a meal
at a communal table:
1) OREBA-DIS: In the first scenario, foods were served
in discrete portions to each participant. The meal
consisted of lasagna (choice between vegetarian and
beef), bread, and yogurt. Additionally, there was water
available to drink, and butter to spread on the bread.
The study setup for OREBA-DIS is shown in Fig. 1.
2) OREBA-SHA: In the second scenario, participants
consumed a communal dish of vegetable korma or
butter chicken with rice and mixed vegetables. Addi-
tionally, there was water available to drink. The study
setup for OREBA-SHA is shown in Fig. 2.
Lasagna and rice-based dishes were chosen since they are
amongst the most common dishes in similar studies [13].
All participants in each scenario are unique, however 22
participants participated in both scenarios.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
B. SENSORS
For each group, video was recorded using a spherical camera
placed in the center of the shared table (360fly-4K5). This
allowed video recording to occur in a simultaneous and un-
obtrusive way for all participants engaging in the communal
eating occasion around the table. The sampling rates are 24
fps for OREBA-DIS, and 30 fps for OREBA-SHA. Each
participant wore two IMUs, one on each wrist (Movisens
Move 3+ 6). The IMU included an accelerometer and a
gyroscope with a sampling rate of 64 Hz. For OREBA-SHA,
four scales additionally recorded the weight of the communal
dishes (two rices dishes, one wet dish, one vegetable dish) at
1 Hz (Adam Equipment CBK 4).
C. SENSOR PROCESSING
1) Video
As shown in Fig. 1, we first mapped the spherical video
from the 360-degree camera to equirectangular represen-
tation7. Then, we separated the equirectangular represen-
tation into individual participant videos by cropping the
areas of interest. We further resized each participant video
to a square shape. The two spatial resolutions 140x140
(e.g., <id>_video_140p.mp4) and 250x250 pixels (e.g.,
<id>_video_250p.mp4) are included. All videos are en-
coded using the H.264 standard and stored in mp4 containers.
2) Inertial Measurement Unit
Raw accelerometer data is measured in g, while gyroscope
data is measured in deg/s. The OREBA dataset includes
(i) raw sensor data without any processing for left and
right hand (e.g., <id>_inertial_raw.csv), and (ii)
processed sensor data for dominant and non-dominant eating
hand (e.g., <id>_inertial_processed.csv). Raw
data is included since a recent study on OREBA indicates
that data preprocessing only marginally improves results
when combined with deep learning [20]. Processed data is
generated from the raw data according to the following steps:
1: Removal of gravity effect. The raw accelerometer read-
ing is subject to acceleration from participants’ wrist move-
ments as well as the earth’s gravitational field. We remove
this gravity effect by estimating sensor orientation using
sensor fusion with Madgwick’s filter [21], rotation of the ac-
celeration vector with the resulting quaternion, and deduction
of the gravity vector (see [3] for a similar approach).
2: Standardization. Each column (i.e. each axis for each
modality and hand) is standardized by subtracting its mean
and dividing by its standard deviation (see [9] for a similar
approach). Processed data can hence be regarded as unitless.
3: Transforming from left and right hand to dominant and
non-dominant hand. To achieve data uniformity, we report
hands in the processed data as dominant and non-dominant.
A similar approach was chosen for the FIC dataset [16].
5See https://www.360fly.com/
6See https://www.movisens.com/en/products/activity-sensor-move-3/
7See https://github.com/prouast/equirectangular-remap
x
y
y
zz
x
FIGURE 3. The wrist-wor n sensors with their internal coordinate frames.
Additionally, the direction of positive rotations are indicated for each axis.
All data reported as dominant hands correspond to right
hands, and non-dominant hands to left hands; for left-handed
participants data for both hands has been transformed to
achieve this. Specifically, we mirrored the data of left-handed
participants to transform the data from the left wrist as if it
has been recorded on the right wrist, and vice versa. Due
to the way the sensors are mounted on the wrist (see Fig.
3), the horizontal direction corresponds to the x axis. For
accelerometer data, we estimate mirroring by flipping the
sign of the x axis, and for gyroscope by flipping the signs
of the y and z axis. Further, we also flip the signs of the x and
y axis to compensate for the different sensor orientations on
the wrists, yielding transformation (1) for accelerometer and
(2) for gyroscope.
[a0
x, a0
y, a0
z]=[(ax),ay, az] = [ax,ay, az](1)
[g0
x, g0
y, g0
z]=[gx,(gy),gz]=[gx, gy,gz](2)
Note that the mirroring technique propopsed here could
also be of use for data augmentation pipelines.
3) Synchronization
Ground truth for sensor synchronization was acquired by
asking participants to clap their hands before starting, and
after finishing their meal (see [3] for a similar approach).
The clapping creates a distinct signature in both the video
recording and the accelerometer. All sensors were trimmed
in time and synchronized for each participant based on these
two reference points.
4) Scales
In addition to detecting individual intake gestures, there is
also a growing body of research on determining the amounts
of food consumed based on continuous weight measurement
using scales [8] [22]. By detecting changes in the amounts
of food on a plate, the scale data can complement other
modalities in detecting intake and/or serving gestures as well
as evaluating the amounts of food consumed from specific
plates. The shared plate setting in OREBA-SHA included
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2. The labeling scheme.
Category Possible values
Main Intake, Serve
Sub Intake-Eat, Intake-Drink,
Serve-Self, Serve-Other
Hand Left, Right, Both
Utensil Fork, Spoon, Hand, Knife, Finger, Cup, Bottle
four scales that measured the weight of the two rice dishes
at two corners of the table as well as the wet dish and the
vegetable dish in the centre of the table (see Figure 2). These
scales recorded the weight of the four dishes in grams at
a sampling rate of 1 Hz. The scale recordings were time-
synchronized by means of a time-lapse camera and a 200g
calibration weight. At the start of the recording, a research
assistant removed the calibration weight from the scale. This
was captured by the scale recordings as well as the time-
lapse camera. Further, the time-lapse camera also captured
the clapping at the start of the recording. Based on this,
each scale recording includes a reference in seconds to the
clapping at the start of a recording. Further, the dataset
provides a mapping of each participant number to the closest
rice dish.
D. ANNOTATION
All relevant gestures were labeled and cross-checked by
two independent annotators using ChronoViz 8. Each gesture
includes a start and an end timestamp:
The start timestamp is the point where the final uninter-
rupted movement to execute the gesture starts;
the end timestamp is the point when the participant has
finished returning their hand(s) from the movement or
started a different gesture.
Additionally, each gesture is assigned four labels accord-
ing to our labeling scheme as listed in Table 2. Besides
the Main identification as an Intake or Serve gesture, this
scheme allows to further specify a Sub category for each
gesture, as well as indicating the Hand and Utensil (e.g.,
<id>_annotations.csv).
The scheme is designed to be extendable with more cate-
gories in possible extensions of the dataset. The discrete dish
scenario OREBA-DIS includes Intake labels. Correspond-
ingly, the shared dish scenario OREBA-SHA includes both
Intake and Serve labels. Figure 4 depicts an example of a
labeled intake gesture and associated IMU sensor data.
E. SPLITS
For machine learning problems with time-intensive training
and evaluation, the best practice is to train, validate, and test
using three separate sets of data [23]. Models are trained
with the training set, hyperparameters are tuned using the
validation set, and reported results are based on the test set.
We choose a split of approximately 3:1:1, such that each
8See http://chronoviz.com
participant only appears in one of the three subsets; this
is to ensure that we are measuring the model’s ability to
generalise and avoid data leakage. The recommended split is
included in the dataset download. Table 3 summarises high-
level statistics on these splits.
F. DEMOGRAPHICS
Out of 180 participants in total, 161 agreed to complete a
demographics questionnaire. Across the dataset, 67% iden-
tified as male and 33% as female. The median age is 24,
with the minimum and maximum age being 18 and 54 years
respectively. Reported ethnicities in the dataset include White
Australian (52.2%), White other European (9.9%), Chinese
(8.7%), Other Asian (8.7%), Persian (5.6%), Arabic (3.1%),
White British (3.1%), African (2.5%), and South East Asian
(1.8%). About 10% reported being left-, and 90% right-
handed.
G. AVAILABILITY
The OREBA dataset is available on request to research
groups at academic institutions. Please visit http://www.
newcastle.edu.au/oreba to download the data sharing agree-
ment and get access.
IV. BASELINE FOR INTAKE GESTURE DETECTION
Intake gesture detection refers to the task of detecting the
times of individual intake gestures from sensor data. Similar
to dataset papers in other areas [11] [12], we provide baseline
results for this task on OREBA-DIS and OREBA-SHA. We
apply the two-stage approach proposed by Kyritsis et al. [9]
to estimate frame-level intake probabilities and detect intake
gestures. For this purpose, we train separate baseline models
on inertial and video sensor data, introduced in Section IV-A.
To ensure comparability with future studies, we use the
publicly available data splits introduced in Section III-E for
training, validation, and test; we additionally report details on
training and used evaluation metric in Section IV-B.
A. BASELINE MODELS
For each modality, we use one simple CNN and one more
complex model proposed in previous studies [20] [24]. As
listed in Table 4, this results in a total of eight baseline
models, considering the different scenarios, modalities, and
models.
1) Inertial
The inertial models are taken from a recent study on OREBA
by Heydarian et al. [20]. We compare the simple CNN with
the more complex CNN-LSTM proposed in the aforemen-
tioned work. The simple CNN consists of seven CNN layers
and one fully-connected layer, with one max pooling layer
following each CNN layer. The CNN-LSTM consists of four
CNN layers with 128 kernels each, two LSTM layers with
64 units each, and one fully-connected layer. Full details on
these models are available in the Supplemental Material S1.
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
left
accelerometer videogyroscope
labels
left
right
Intake, Intake-Eat, Right, Spoon
right
LR
FIGURE 4. Example of a labeled intake gesture with video, accelerometer, and gyroscope sensor data. For easier display, the video framerate has been reduced.
TABLE 3. Summary statistics for our dataset and the training/validation/test split.
Training Validation Test Total
Scenario Type # Mean [s] Std [s] # Mean [s] Std [s] # Mean [s] Std [s] # Mean [s] Std [s]
OREBA-DIS Participants 61 804.98 238.64 20 793.14 254.36 19 875.77 217.40 100 816.07 237.47
Intake Gest. 2907 2.36 1.04 943 2.24 0.98 940 2.29 1.01 4790 2.32 1.02
OREBA-SHA
Participants 63 838.55 259.76 20 811.31 221.04 19 824.60 201.68 102 830.61 240.80
Intake Gest. 2574 2.44 1.16 896 2.23 1.17 809 2.24 1.02 4279 2.36 1.14
Serve Gest. 337 9.87 4.52 107 10.99 6.56 112 9.05 4.34 556 9.92 4.97
2) Video
The video models are taken from a recent study on OREBA
by Rouast et al. [24]. For our comparison we use the simple
CNN and the more complex ResNet-50 SlowFast proposed
by Rouast et al. While the simple CNN only uses one frame
at a time, the ResNet-50 SlowFast model uses 16 frames.
The ResNet-50 SlowFast model consists of two 50-layer 3D
CNNs which are fused with lateral connections and spatially
aligned 2D conv fusion. Full details on these models are
available in the Supplemental Material S2.
B. TRAINING AND EVALUATION METRICS
1) Training
We train a total of eight baseline models. All baseline models
are trained using the Adam optimizer with an exponentially
decaying learning rate on the respective training dataset. We
use batch size 256 for inertial data, and batch sizes 8 (ResNet-
50 SlowFast) / 64 (CNN) for video data. Model selection is
done using the validation set.
2) Evaluation metrics
We extend the evaluation scheme proposed by Kyritsis et
al. [9] as depicted in Fig. 5. The scheme uses the ground
truth to translate sparse detections into measurable metrics
for a given label category. As Rouast and Adam [24] re-
port, one correct detection per ground truth event counts
as a true positive (TP), while further detections within the
same ground truth event are false positives of type 1 (FP 1).
Detections outside ground truth events are false positives of
tTP FNFP1FP2FP3
1 2 3 4 5 Ground
truth
Other class
Detections
FIGURE 5. The evaluation scheme (proposed by [9]; figure from [24]
extended here). (1) A true positive is the first detection within each ground
truth event; (2) False positives of type 1 are further detections within the same
ground truth event; (3) False positives of type 2 are detections outside ground
truth events; (4) False positives of type 3 are detections made for the wrong
class; (5) False negatives are non-detected ground truth events.
type 2 (FP 2) and non-detected ground truth events count as
false negatives (FN ). The scheme has been extended here to
support the multi-class case, where detections for a wrong
class are false positives of type 3. Based on the aggregate
counts, precision ( TP
TP+FP 1+FP 2+FP 3), recall ( TP
TP+FN ), and
the F1score (2Precision Recall
Precision +Recall ) can be calculated.
C. BASELINE RESULTS
Table 4 reports the test set results for the aforementioned
models on both OREBA-DIS and OREBA-SHA.
1) Inertial
Heydarian et al. [20] ran multiple experiments benchmarking
different deep learning models and pre-processing pipelines.
The top model performance was achieved by a CNN-LSTM
with earliest fusion through a dedicated CNN layer and target
matching. Concerning preprocessing, their results show that
applying a consecutive combination of mirroring, removing
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 4. Baseline test set results for intake gesture detection. On OREBA-DIS, the video model performs better than the inertial model, while the opposite is true
on OREBA-SHA. This indicates that the test set for OREBA-DIS is more challenging for using inertial data, while the test set for OREBA-SHA is more challenging
for using video data.
Dataset Modality Method T P F P1F P2F N F1
OREBA-DIS
Video CNN [24] 668 28 242 267 0.713
ResNet-50 SlowFast [24] 749 20 52 186 0.853
Inertial CNN [20] 702 42 171 235 0.758
CNN-LSTM [20] 743 41 188 194 0.778
OREBA-SHA
Video CNN [24] 564 26 228 239 0.696
ResNet-50 SlowFast [24] 670 16 170 133 0.808
Inertial CNN [20] 719 46 146 84 0.839
CNN-LSTM [20] 732 34 149 71 0.852
Note: The total numbers of intake gestures may slightly differ from Table 3. This is a technical
implication of sampling with different frequencies (e.g., 8 fps for video), which can cause temporally
close intake gestures to merge.
the gravity effect, and standardization was beneficial for
model performance, while smoothing had adverse effects.
From the results in Table 4, it appears that the test set
for OREBA-DIS (F1= 0.778) is more challenging for
inertial data than the test set for OREBA-SHA (F1= 0.852).
Comparing the simple CNN with the more advanced CNN-
LSTM approach, we find that the more advanced CNN-
LSTM add relative improvements of 2.6% (OREBA-DIS)
and 1.5% (OREBA-SHA) over the simple CNN.
2) Video
Rouast and Adam [24] applied several deep learning architec-
tures established in the literature on video action recognition
on the task of detecting intake gestures directly from the
video data in OREBA-DIS. The best test set result was
achieved using a SlowFast [25] network with ResNet-50
[26] as backbone. Further conclusions from the experiments
are that appearance features are more useful than motion
features, and that temporal context in form of multiple video
frames is essential for top model performance.
The results in Table 4 indicate that the test set for OREBA-
SHA (F1= 0.808) is more challenging when working with
video data than the test set for OREBA-DIS (F1= 0.853).
Comparing results between the models, we find that the more
advanced ResNet-50 SlowFast adds relative improvements of
19.6% (OREBA-DIS) and 16.1% (OREBA-SHA) over the
simple CNN.
V. DISCUSSION
In this paper, we have introduced the OREBA dataset, which
provides a comprehensive multi-sensor recording with la-
beled gestures of communal intake occasions from discrete
and shared meals. Building on a summary of related work
on automatic dietary monitoring and an overview of existing
public datasets in the field, we provided details on the data
collection, sensor processing, and annotation methods em-
ployed in the creation of the OREBA dataset. Additionally,
we reported baseline results on the task of intake gesture
detection based on video and inertial sensor data.
Sensor-based, passive methods of dietary monitoring have
the potential of complementing existing active methods such
as food records and 24-hr recall. As seen in other fields
such as object recognition [11] and action recognition [10],
progress in the research of machine learning methods is
tightly linked to the availability and ongoing development of
datasets with labeled examples. In this light, we hope that the
OREBA dataset will be able to support future developments
in automatic dietary monitoring. Compared to existing public
datasets of labeled intake gestures, the OREBA dataset is
unique as it is (i) multimodal with synchronised frontal video
data based on spherical video recordings, and inertial data
from both hands at 64 Hz, and (ii) includes a total amount of
202 recordings in two different communal eating scenarios.
As the first intake gesture detection dataset that also makes
the video recordings available, OREBA enables researchers
to independently verify, extend, and update the provided
data annotations. This further increases the transparency and
reliability of the dataset, and the machine learning models
building on it.
While we have reported results from our initial studies on
detecting intake gestures with either video or inertial sensor
data, there are also several other directions that research on
this dataset could go into. Sensor fusion of video and IMU for
detecting intake gestures could be explored to combine the
strengths of both approaches. Further, the OREBA dataset
can be used to compare and contrast CNN-LSTM [9] [24]
and CNN-GRU [17] models. Improvements could also be
made by using transfer learning between the two different
scenarios, allowing to contrast the difficulties of monitoring
discrete versus shared dishes. Thanks to the availability of
inertial data for both hands, studies could also explore how
much information is retained on the dominant versus the non-
dominant hand, which has implications for automatic dietary
monitoring using commercial smartwatches. Further, while
our initial studies focused on detecting individual intake ges-
tures, future studies could explore the other label categories
– for example, how well the video or inertial modalities
perform at distinguishing between different utensils. Finally,
sensor fusion of video and IMU is a further possibility
that could be explored in the future. As such, and beyond
researchers specifically interested in dietary monitoring, the
OREBA dataset could also be a valuable resource for re-
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
searchers interested in advancing machine learning models
for sensor fusion more broadly.
VI. CONCLUSIONS
Publicly available datasets are an important resource to
fuel advances in machine learning [11] [12]. In this paper,
we have introduced a comprehensive multi-sensor recording
dataset with labeled gestures of communal intake occasions
from discrete and shared meals. To the best of our knowledge,
this is the first dataset for intake gesture detection that pro-
vides synchronized data in the form of both frontal video and
inertial sensor data from both hands. By making this dataset
publicly available to the research community, OREBA has
the potential to advance research and foster innovation in
this area as it allows researchers to objectively benchmark
existing and emerging machine learning approaches, and
reducing the burden for researchers to collect and annotate
their own data.
ACKNOWLEDGMENTS
We acknowledge the untiring support of Clare Cummings,
Grace Manning, Alice Melton, Kaylee Slater, Felicity Steel
and Sam Stewart in collecting and annotating the data.
REFERENCES
[1] S. W. Lichtman, K. Pisarska, E. R. Berman, M. Pestone, H. Dowling,
E. Offenbacher, H. Weisel, S. Heshka, D. E. Matthews, and S. B. Heyms-
field, “Discrepancy between self-reported and actual caloric intake and
exercise in obese subjects,” New England J. Medicine, vol. 327, no. 27,
pp. 1893–1898, 1992.
[2] P. V. Rouast, M. T. P. Adam, T. Burrows, R. Chiong, and M. E. Rollo,
“Using deep learning and 360 video to detect eating behavior for user
assistance systems,” in Proc. Europ. Conf. Information Systems, 2018, pp.
1–11.
[3] K. Kyritsis, C. Diou, and A. Delopoulos, “Food intake detection from
inertial sensors using lstm networks,” in Proc. Int. Conf. Image Analysis
and Processing, 2017, pp. 411–418.
[4] J. Qiu, F. P.-W. Lo, and B. Lo, “Assessing individual dietary intake in food
sharing scenarios with a 360 camera and deep learning,” in Proc. Int. Conf.
Wearable and Implantable Body Sensor Networks, 2019, pp. 1–4.
[5] D. Konstantinidis, K. Dimitropoulos, B. Langlet, P. Daras, and
I. Ioakimidis, “Validation of a deep learning system for the full automation
of bite and meal duration analysis of experimental meal videos,” Nutrients,
vol. 12, no. 209, pp. 1–16, 2020.
[6] T. Burrows, C. Collins, M. T. P. Adam, K. Duncanson, and M. Rollo,
“Dietary assessment of shared plate eating: A missing link,” Nutrients,
vol. 11, no. 4, pp. 1–14, 2019.
[7] C. Merck, C. Maher, M. Mirtchouk, M. Zheng, Y. Huang, and S. Klein-
berg, “Multimodality sensing for eating recognition,” in Proc. Int. Conf.
Pervasive Computing Technologies for Healthcare, 2016, pp. 130–137.
[8] Y. Shen, J. Salley, E. Muth, and A. Hoover, “Assessing the accuracy of a
wrist motion tracking method for counting bites across demographic and
food variables,” IEEE J. Biomedical and Health Informatics, vol. 21, no. 3,
pp. 599–606, 2017.
[9] K. Kyritsis, C. Diou, and A. Delopoulos, “Modeling wrist micromove-
ments to measure in-meal eating behavior from inertial sensor data,” IEEE
J. Biomedical and Health Informatics, pp. 1–11, 2019.
[10] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-
narasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and
A. Zisserman, “The kinetics human action video dataset,” arXiv preprint
arXiv:1705.06950, 2017.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A
large-scale hierarchical image database,” in Proc. CVPR, 2009, pp. 248–
255.
[12] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database
for facial expression, valence, and arousal computing in the wild,” IEEE
Trans. Affect. Comput., pp. 1–17, 2017.
[13] H. Heydarian, M. Adam, T. Burrows, C. Collins, and M. E. Rollo,
“Assessing eating behaviour using upper limb mounted motion sensors:
A systematic review,” Nutrients, vol. 11, no. 1168, pp. 1–25, 2019.
[14] S. Zhang, D. T. Nguyen, G. Zhang, R. Xu, N. Maglaveras, and N. Al-
shurafa, “Estimating caloric intake in bedridden hospital patients with
audio and neck-worn sensors,” in Proc. Int. Conf. Connected Health:
Applications, Systems and Engineering Technologies. ACM, 2018, pp.
1–2.
[15] S. Zhang, W. Stogin, and N. Alshurafa, “I sense overeating: Motif-
based machine learning framework to detect overeating using wrist-worn
sensing,” Inf. Fusion, vol. 41, pp. 37–47, 2018.
[16] K. Kyritsis, C. Diou, and A. Delopoulos, “A data driven end-to-end ap-
proach for in-the-wild monitoring of eating behavior using smartwatches,”
IEEE J. Biomedical and Health Informatics, pp. 1–13, 2020.
[17] “A sequence-to-sequence model-based deep learning approach for rec-
ognizing activity of daily living for senior care,” Journal of Biomedical
Informatics, vol. 84, pp. 148–158, 2018.
[18] P. V. Rouast, M. T. P. Adam, and R. Chiong, “Deep learning for human
affect recognition: Insights and new developments,” IEEE Trans. Affect.
Comput., pp. 1–20, 2019.
[19] S. Hantke, F. Weninger, R. Kurle, F. Ringeval, A. Batliner, A. E.-D. Mousa,
and B. Schuller, “I hear you eat and speak: Automatic recognition of eating
condition and food type, use-cases, and impact on asr performance,” PLoS
One, vol. 11, no. 5, pp. 1–24, 2016.
[20] H. Heydarian, P. V. Rouast, M. T. P. Adam, T. Burrows, and M. E.
Rollo, “Deep learning for intake gesture detection from wrist-worn iner-
tial sensors: The effects of preprocessing, sensor modalities, and sensor
positions,” IEEE Access, vol. 8, pp. 1–14, 2020.
[21] S. Madgwick, “An efficient orientation filter for inertial and iner-
tial/magnetic sensor arrays,” University of Bristol (UK), Tech. Rep., 2010.
[22] V. Papapanagiotou, C. Diou, I. Ioakimidis, P. Sodersten, and A. Delopou-
los, “Automatic analysis of food intake and meal microstructure based
on continuous weight measurements,” IEEE J. Biomedical and Health
Informatics, vol. 23, no. 2, pp. 893–902, 2019.
[23] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016.
[24] P. V. Rouast and M. T. P. Adam, “Learning deep representations for
video-based intake gesture detection,” IEEE J. Biomedical and Health
Informatics, vol. 24, no. 6, pp. 1727–1737, 2019.
[25] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for
video recognition,” arXiv preprint arXiv:1812.03982, 2018.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. CVPR, 2016, pp. 770–778.
PHILIPP V. ROUAST received the B.Sc. and
M.Sc. degrees in Industrial Engineering from
Karlsruhe Institute of Technology, Germany, in
2013 and 2016 respectively. He is currently work-
ing towards the Ph.D. degree in Information Sys-
tems and is a graduate research assistant at The
University of Newcastle, Australia. His research
interests include deep learning, affective comput-
ing, HCI, and related applications of computer
vision. Find him at https://www.rouast.com.
HAMID HEYDARIAN received the B.Sc. in Com-
puter Engineering (software) from Kharazmi Uni-
versity, Iran, in 2002. He is a senior software
developer currently working towards the Ph.D. in
Information Technology and is a casual academic
at The University of Newcastle, Australia. His re-
search interests include inertial signal processing
using deep learning and its related applications
in dietary intake assessment and passive dietary
monitoring.
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3026965, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
MARC T. P. ADAM is an Associate Profes-
sor in Computing and Information Technology
at the University of Newcastle, Australia. In his
research, he investigates the interplay of human
users’ cognition and affect in human-computer
interaction. He is a founding member of the So-
ciety for NeuroIS. He received an undergraduate
degree in Computer Science from the University
of Applied Sciences Würzburg, Germany, and a
PhD in Economics of Information Systems from
Karlsruhe Institute of Technology, Germany.
MEGAN E. ROLLO is a Research Fellow in Nu-
trition and Dietetics within the School of Health
Sciences and Priority Research Centre for Phys-
ical Activity and Nutrition at The University of
Newcastle, Australia. She has received BAppSci,
BHlthhSci(Nutr&Diet), and PhD degrees from the
Queensland University of Technology, Australia.
She has research interests in technology-assisted
dietary assessment and personalized behavioral
nutrition interventions.
VOLUME 4, 2016 9
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Wrist-worn inertial measurement units have emerged as a promising technology to passively capture dietary intake data. State-of-the-art approaches use deep neural networks to process the collected inertial data and detect characteristic hand movements associated with intake gestures. In order to clarify the effects of data preprocessing, sensor modalities, and sensor positions, we collected and labeled inertial data from wrist-worn accelerometers and gyroscopes on both hands of 100 participants in a semi-controlled setting. The method included data preprocessing and data segmentation, followed by a two-stage approach. In Stage 1, we estimated the probability of each inertial data frame being intake or non-intake, benchmarking different deep learning models and architectures. Based on the probabilities estimated in Stage 1, we detected the intake gestures in Stage 2 and calculated the F1 score for each model. Results indicate that top model performance was achieved by a CNN-LSTM with earliest sensor data fusion through a dedicated CNN layer and a target matching technique (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> = .778). As for data preprocessing, results show that applying a consecutive combination of mirroring, removing gravity effect, and standardization was beneficial for model performance, while smoothing had adverse effects. We further investigate the effectiveness of using different combinations of sensor modalities (i.e., accelerometer and/or gyroscope) and sensor positions (i.e., dominant intake hand and/or non-dominant intake hand).
Article
Full-text available
The increased worldwide prevalence of obesity has sparked the interest of the scientific community towards tools that objectively and automatically monitor eating behavior. Despite the study of obesity being in the spotlight, such tools can also be used to study eating disorders (e.g. anorexia nervosa) or provide a personalized monitoring platform for patients or athletes. This paper presents a complete framework towards the automated i) modeling of in-meal eating behavior and ii) temporal localization of meals, from raw inertial data collected in-the-wild using commercially available smartwatches. Initially, we present an end-to-end Neural Network which detects food intake events (i.e. bites). The proposed network uses both convolutional and recurrent layers that are trained simultaneously. Subsequently, we show how the distribution of the detected bites throughout the day can be used to estimate the start and end points of meals, using signal processing algorithms. We perform extensive evaluation on each framework part individually. Leave-one-subject-out (LOSO) evaluation shows that our bite detection approach outperforms four state-of-the-art algorithms towards the detection of bites during the course of a meal (0.923 F1 score). Furthermore, LOSO and held-out set experiments regarding the estimation of meal start/end points reveal that the proposed approach outperforms a relevant approach found in the literature (Jaccard Index of 0.820 and 0.821 for the LOSO and held-out experiments, respectively). Experiments are performed using our publicly available FIC and the newly introduced FreeFIC datasets.
Article
Full-text available
Eating behavior can have an important effect on, and be correlated with, obesity and eating disorders. Eating behavior is usually estimated through self-reporting measures, despite their limitations in reliability, based on ease of collection and analysis. A better and widely used alternative is the objective analysis of eating during meals based on human annotations of in-meal behavioral events (e.g., bites). However, this methodology is time-consuming and often affected by human error, limiting its scalability and cost-effectiveness for large-scale research. To remedy the latter, a novel “Rapid Automatic Bite Detection” (RABiD) algorithm that extracts and processes skeletal features from videos was trained in a video meal dataset (59 individuals; 85 meals; three different foods) to automatically measure meal duration and bites. In these settings, RABiD achieved near perfect agreement between algorithmic and human annotations (Cohen’s kappa κ = 0.894; F1-score: 0.948). Moreover, RABiD was used to analyze an independent eating behavior experiment (18 female participants; 45 meals; three different foods) and results showed excellent correlation between algorithmic and human annotations. The analyses revealed that, despite the changes in food (hash vs. meatballs), the total meal duration remained the same, while the number of bites were significantly reduced. Finally, a descriptive meal-progress analysis revealed that different types of food affect bite frequency, although overall bite patterns remain similar (the outcomes were the same for RABiD and manual). Subjects took bites more frequently at the beginning and the end of meals but were slower in-between. On a methodological level, RABiD offers a valid, fully automatic alternative to human meal-video annotations for the experimental analysis of human eating behavior, at a fraction of the cost and the required time, without any loss of information and data fidelity.
Article
Full-text available
Automatic detection of individual intake gestures during eating occasions has the potential to improve dietary monitoring and support dietary recommendations. Existing studies typically make use of on-body solutions such as inertial and audio sensors, while video is used as ground truth. Intake gesture detection directly based on video has rarely been attempted. In this study, we address this gap and show that deep learning architectures can successfully be applied to the problem of video-based detection of intake gestures. For this purpose, we collect and label video data of eating occasions using 360-degree video of 102 participants. Applying state-of-the-art approaches from video action recognition, our results show that (1) the best model achieves an F1 score of 0.858, (2) appearance features contribute more than motion features, and (3) temporal context in form of multiple video frames is essential for top model performance.
Article
Full-text available
Wearable motion tracking sensors are now widely used to monitor physical activity, and have recently gained more attention in dietary monitoring research. The aim of this review is to synthesise research to date that utilises upper limb motion tracking sensors, either individually or in combination with other technologies (e.g., cameras, microphones), to objectively assess eating behaviour. Eleven electronic databases were searched in January 2019, and 653 distinct records were obtained. Including 10 studies found in backward and forward searches, a total of 69 studies met the inclusion criteria, with 28 published since 2017. Fifty studies were conducted exclusively in laboratory settings, 13 exclusively in free-living settings, and three in both settings. The most commonly used motion sensor was an accelerometer (64) worn on the wrist (60) or lower arm (5), while in most studies (45), accelerometers were used in combination with gyroscopes. Twenty-six studies used commercial-grade smartwatches or fitness bands, 11 used professional grade devices, and 32 used standalone sensor chipsets. The most used machine learning approaches were Support Vector Machine (SVM, n = 21), Random Forest (n = 19), Decision Tree (n = 16), Hidden Markov Model (HMM, n = 10) algorithms, and from 2017 Deep Learning (n = 5). While comparisons of the detection models are not valid due to the use of different datasets, the models that consider the sequential context of data across time, such as HMM and Deep Learning, show promising results for eating activity detection. We discuss opportunities for future research and emerging applications in the context of dietary assessment and monitoring.
Article
Full-text available
Overweight and obesity are both associated with in-meal eating parameters such as eating speed. Recently, the plethora of available wearable devices in the market ignited the interest of both the scientific community and the industry towards unobtrusive solutions for eating behavior monitoring. In this paper we present an algorithm for automatically detecting the in-meal food intake cycles using the inertial signals (acceleration and orientation velocity) from an off-the-shelf smartwatch. We use 5 specific wrist micromovements to model the series of actions leading to and following an intake event (i.e. bite). Food intake detection is performed in two steps. In the first step we process windows of raw sensor streams and estimate their micromovement probability distributions by means of a Convolutional Neural Network (CNN). In the second step we use a Long-Short Term Memory (LSTM) network to capture the temporal evolution and classify sequences of windows as food intake cycles. Evaluation is performed using a challenging dataset of 21 meals from 12 subjects. In our experiments we compare the performance of our algorithm against three state-of-the-art approaches, where our approach achieves the highest F1 detection score (0.913 in the Leave-One-Subject-Out experiment). The dataset used in the experiments is available at https://mug.ee.auth.gr/intake-cycle-detection/.
Article
Shared plate eating is a defining feature of the way food is consumed in some countries and cultures. Food may be portioned to another serving vessel or directly consumed into the mouth from a centralised dish rather than served individually onto a discrete plate for each person. Shared plate eating is common in some low-and lower-middle income countries (LLMIC). The aim of this narrative review was to synthesise research that has reported on the assessment of dietary intake from shared plate eating, investigate specific aspects such as individual portion size or consumption from shared plates and use of technology in order to guide future development work in this area. Variations of shared plate eating that were identified in this review included foods consumed directly from a central dish or shared plate food, served onto additional plates shared by two or more people. In some settings, a hierarchical sharing structure was reported whereby different family members eat in turn from the shared plate. A range of dietary assessment methods have been used in studies assessing shared plate eating with the most common being 24-h recalls. The tools reported as being used to assist in the quantification of food intake from shared plate eating included food photographs, portion size images, line drawings, and the carrying capacity of bread, which is often used rather than utensils. Overall few studies were identified that have assessed and reported on methods to assess shared plate eating, highlighting the identified gap in an area of research that is important in improving understanding of, and redressing dietary inadequacies in LLMIC.