ArticlePDF Available

Learning Deep Representations for Video-Based Intake Gesture Detection



Automatic detection of individual intake gestures during eating occasions has the potential to improve dietary monitoring and support dietary recommendations. Existing studies typically make use of on-body solutions such as inertial and audio sensors, while video is used as ground truth. Intake gesture detection directly based on video has rarely been attempted. In this study, we address this gap and show that deep learning architectures can successfully be applied to the problem of video-based detection of intake gestures. For this purpose, we collect and label video data of eating occasions using 360-degree video of 102 participants. Applying state-of-the-art approaches from video action recognition, our results show that (1) the best model achieves an F1 score of 0.858, (2) appearance features contribute more than motion features, and (3) temporal context in form of multiple video frames is essential for top model performance.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Learning deep representations for video-based
intake gesture detection
Philipp V. Rouast, Student Member, IEEE, Marc T. P. Adam
Abstract—Automatic detection of individual intake gestures
during eating occasions has the potential to improve dietary
monitoring and support dietary recommendations. Existing stud-
ies typically make use of on-body solutions such as inertial and
audio sensors, while video is used as ground truth. Intake gesture
detection directly based on video has rarely been attempted. In
this study, we address this gap and show that deep learning
architectures can successfully be applied to the problem of video-
based detection of intake gestures. For this purpose, we collect
and label video data of eating occasions using 360-degree video
of 102 participants. Applying state-of-the-art approaches from
video action recognition, our results show that (1) the best model
achieves an F1score of 0.858, (2) appearance features contribute
more than motion features, and (3) temporal context in form of
multiple video frames is essential for top model performance.
Index Terms—Deep learning, intake gesture detection, dietary
monitoring, video-based
DIETARY monitoring plays an important role in assessing
an individual’s overall dietary intake and, based on this,
providing targeted dietary recommendations. Dietitians [1] and
personal monitoring solutions [2] rely on accurate dietary
information to support individuals in meeting their health
goals. For instance, research has shown that the global risk and
burden of non-communicable disease is associated with poor
diet and hence requires targeted interventions [3]. However,
manually assessing dietary intake often involves considerable
processing time and is subject to human error [4].
Automatic dietary monitoring aims to detect (i) when, (ii)
what, and (iii) how much is consumed [5]. This is a complex
and multi-faceted problem involving tasks such as action
detection to identify intake gestures (when), object recognition
and segmentation to identify individual foods (what), as well
as volume and density estimation to derive food quantity
(how much). A variety of sensors have been explored in the
literature, including inertial, audio, visual, and piezoelectric
sensors [5], [6], [7].
Detection of individual intake gestures can improve detec-
tion of entire eating occasions [8] and amounts consumed [9].
It also provides access to measures such as intake speed, as
well as meta-information for easier review of videos. Although
video is often used as ground truth for studies focused on
detecting chews, swallows, and intake gestures, it has rarely
been used as the basis for automatic detection. However, there
are several indications that video could be a suitable data
The authors are with the School of Electrical Engineering and Computing,
The University of Newcastle, Callaghan, NSW 2308, Australia. E-mail:,
source to monitor such events: (i) increasing exploration of
video monitoring in residential and hospital settings [10], [11],
(ii) the rich amount of information embedded in the visual
modality, and (iii) recent advances in machine learning, and
in particular deep learning [12], for video action recognition
that have largely been left unexplored in dietary monitoring.
In this paper, we address this gap by demonstrating the
feasibility of using deep neural networks (DNNs) for auto-
matic detection of intake gestures from raw video frames. For
this purpose, we investigate the 3D CNN [13], CNN-LSTM
[14], Two-Stream [15], and SlowFast [16] architectures which
have been applied in the field of video action recognition,
but not for dietary monitoring. These architectures allow to
consider temporal context in the form of multiple frames.
Further, instead of relying on handcrafted models and features,
deep learning leverages a large number of examples to learn
feature representations on multiple levels of abstraction. In
dietary monitoring, deep learning has mainly been used for
image-based food recognition (what) [17], and recently in
intake gesture detection based on inertial sensors (when) [6].
However, it has yet to be applied on video-based intake gesture
detection. Our main contributions are the following:
1) We fill the gap between dietary monitoring and video
action recognition by demonstrating the feasibility of
using deep learning architectures to detect individual
intake gestures from raw video frames. We conduct a
laboratory study with 102 participants and 4891 intake
gestures, by sourcing video from a 360-degree camera
placed in the center of the table. A ResNet-50 SlowFast
model achieved the best F1score of 0.858.
2) Video action recognition can build on both appearance
and motion features. It is in general not clear which are
more important for a given action [18]. Using a 2D CNN
without temporal context, we show that appearance (indi-
vidual frames) performs better than motion (optical flow
between adjacent frames) for detecting intake gestures.
3) Similarly, it is not clear to what extent temporal context
improves model accuracy of detecting a given action [19].
Comparing the best model with (ResNet-50 SlowFast)
and the best 2D CNN without temporal context, we find
a relative F1improvement of 8%.
The remainder of the paper is organized as follows: In
Section II, we discuss the related literature, including dietary
monitoring and video action recognition. Our proposed models
are introduced in Section III, and the dataset in Section IV.
We present our experiments and results in Section V, and draw
conclusions in Section VI.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
2D fusion
3) Two-Stream
3D fusion
4) SlowFast
frame 1 frame K
1) 3D CNN
Fig. 1. The four investigated approaches from video action recognition based on temporal context t(adapted from Carreira and Zisserman [20]).
A. Dietary monitoring
At the conceptual level, dietary monitoring broadly captures
three components of recognition, namely what (e.g., identi-
fication of specific foods), how much (i.e., quantification of
consumed food), and when (i.e., timing of eating occasions).
Traditional paper-based methods such as recalls and special-
ized questionnaires [21] are still commonly used by dietitians.
Amongst end-users, mobile applications that allow manual
logging of individual meals are also popular. These active
methods are characterized by a considerable amount of effort,
and known to be affected by biases and human error [4].
Realizing the requirement for objective measurements of
a person’s diet, several sensor-based approaches of passively
collecting information associated with diet have been proposed
in the literature. With the emergence of labeled databases of
food images [22], [17], food recognition from still images has
become a popular task in computer vision research. The state
of the art uses features learned by deep convolutional neural
networks (CNNs) to distinguish between food classes [23].
CNNs are DNNs especially designed for visual inputs.
Image-based estimation of food volume and associated calo-
ries typically extends food recognition by volume estimation of
different foods, and linking with nutrient databases [24], [25].
Estimation of food volume from audio and inertial sensors
based on individual bite sizes has also been proposed [9].
In detecting intake behavior, we distinguish between detec-
tion of events describing meal microstructure (e.g., individual
intake gestures), and detecting intake occasions as a whole
(e.g., a meal), which can be seen as clusters of detected
events [26]. Besides aiding in the estimation of food volume
[9], information about meal microstructure can be leveraged
to improve active methods [27]. It also allows dietitians to
quantify measures of interest such as the rate of eating [28].
In general, detection of chews and swallows is typically
attempted using on-body audio or piezoelectric sensors, whilst
detection of intake gestures is the domain of wrist-mounted in-
ertial sensors [29]. Chews and swallows generate characteristic
audio signatures, which was exploited for automatic detection
of meal microstructure as early as 2005 [30], [31]. Swallows
can also be registered using piezoelectric sensors measuring
strain on the jaw [32]. Inertial sensors can be used to measure
the acceleration and spatial movements of the wrist to identify
intake gestures [33], [34], [35]. Recently, DNNs were applied
for this purpose [6].
B. Video-based intake gesture recognition
Despite the importance of visual sensors for recording
ground truth, video data of eating occasions is rarely consid-
ered as the basis for automatic detection of meal microstruc-
ture. This is surprising, as the visual modality contains a broad
range of information about intake behavior. In fact, in 2004,
one of the earliest works in this field considered surveillance
type video recorded in a nursing home to detect intake ges-
tures [36]. This approach relied on optical flow-based motion
features, which were used to train a Hidden Markov Model.
A further approach used object detection of face, mouth, and
eating utensils which was realised with haar-like appearance
features [37]. We also see skeleton-based approaches with
additional depth information [38], [39]. Deep learning, which
is the state of the art for video action recognition, has not been
explored to the best of our knowledge.
C. Video action recognition
The task of action recognition from video extends 2D image
input by the dimension of time. While temporal context can
carry useful information, it also complicates the search for
good feature representations given the typically much larger
dimensionality of the raw data. Before the proliferation of deep
learning, approaches in video action recognition would follow
the traditional paradigm of pattern recognition: Computing
complex hand-crafted features from raw video frames, based
on which shallow classifiers could be learned. Such features
were either video-level aggregation of local spatio-temporal
features such as HOG3D [40], or point trajectories of dense
points computed, e.g., using optical flow [41]. The following
four deep learning architectures emerged from the literature
on video action recognition, as shown in Fig. 1:
1) 3D CNN – Spatio-temporal convolutions: The 3D CNN
approach features 3D convolutions instead of the 2D con-
volutions found in 2D CNNs. Videos are treated as spatio-
temporal volumes, where the third dimension represents tem-
poral context. 3D CNNs can thus automatically learn low-
level features that take into account both spatial and temporal
information. This approach was first proposed in 2010 by Ji
et al. [13], who integrated 3D convolutions with handcrafted
features. Running experiments with end-to-end training on
larger datasets, Karpathy et al. [19] reported that it works
best to slowly fuse the temporal information throughout the
network. However, they found that temporal context only
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
improves model accuracy for some classes such as juggling;
furthermore, it reduced accuracy for some classes [19]. Other
experiments regarding architecture choices concluded that 3D
CNN can model appearance and motion simultaneously [42].
2) CNN-LSTM – Incorporating recurrent neural networks:
In the CNN-LSTM approach, the temporal context is modelled
by a recurrent neural network (RNN). RNNs are DNNs that
take the previous model state as an additional input. In 2015,
Donahue et al. [14] proposed to use the sequence of high-
level spatial features learned by a CNN from individual video
frames as input into a long short-term memory (LSTM) RNN.
Such LSTM networks are known to be easier to train for longer
sequences [43]. The CNN-LSTM model has the advantage of
being more flexible with regards to the number of input frames,
but has relatively many parameters and appears to be more data
hungry in comparison to other approaches [20].
3) Two-Stream – Decoupling appearance and motion: In
2014, Simonyan and Zisserman [15] observed that 2D CNN
models without temporal context achieved accuracy close to
the 3D CNN approach [19], and that state-of-the-art accuracies
involved handcrafted trajectory-based representations based on
motion features. They proposed the two-stream architecture,
which decouples appearance and motion by using a single
still frame (appearance) and temporal context in form of
stacked optical flow (motion). Both are fed into separate
CNNs, where the appearance CNN is pre-trained on the large
ImageNet database. While the original design employed score-
level fusion [15], later variants used feature-level fusion of the
last CNN layers [44].
4) SlowFast – Joint learning at different temporal resolu-
tions: The SlowFast architecture proposed by Feichtenhofer et
al. [16] in late 2018 learns from temporal context at multiple
temporal resolutions. As of mid 2019, it represents the state
of the art in video action recognition with 79% accuracy on
the large Kinetics dataset without any pre-training. The idea of
decoupling slow and fast motion is integrated into the network
design. Two pathways make up the SlowFast architecture,
consisting of a 3D CNN each: The slow pathway has more
capacity to learn about appearance than motion, while the
fast pathway works the other way around. This is realized by
setting a factor αas the difference in sequence downsampling,
and a factor βas the difference in learned channels. A
number of lateral connections allow that information from both
pathways is fused.
Detecting individual intake gestures from video requires
prediction of sparse points in time. We adopt the approach
of Kyritsis et al. [6] and split this problem into two stages, as
illustrated in Fig. 2:
Stage I: Estimation of state probability at the frame level,
i.e., estimating the probability pintake for each frame, and
Stage II: Detection of intake gestures, by selecting sparse
points in time based on the estimated probabilities.
A. Stage I: Models for frame-level probability estimation
In Stage I, our models estimate pintake, which is the
probability that the label of the target frame is “intake”. The
Stage I
≥ d
Stage II
Fig. 2. Illustration of sample outputs at the two stages. In Stage I, the model
estimates the probability pintake for a target frame. For models with temporal
context, input consists of multiple frames (16 in our experiments), of which
the last frame is the target. In Stage II, detections of intake events are realized
using a local maximum search on the pt-thresholded series of probabilities,
where the detections have to be at least dapart (in our experiments, d= 2s).
four models identified from the literature on video action
recognition represent our main models (3D CNN, CNN-
LSTM, Two-Stream, SlowFast; see Fig. 1). In addition to the
target frame, each of these four models considers a temporal
context of further frames preceding the target frame. As a
baseline and for experiments, we additionally employ a 2D
CNN. Because the 2D CNN does not have a temporal context,
this enables us to (1) discern to what extent the temporal
context improves model performance and (2) directly compare
the importance of appearance and motion features.
For each model, we propose a small instantiation with rela-
tively few parameters, and a larger instantiation using ResNet-
50 [45] as backbone. In the following, we present each of
the proposed models adapted for food intake gesture detection
(see Tables I and II for details). Source code for all models is
available at
0) 2D CNN: The 2D CNN functions as a baseline for our
study, indicating what is possible without temporal context.
This allows us to discern the importance of the temporal
context for intake gesture detection. Further, the 2D CNN also
allows us to directly compare a model based solely on motion
to one solely based on visual appearance. This assessment is
not possible for the other four models.
Motion information can be of importance for classes with
fast movement such as juggling [19]. For detection of intake
gestures, it seems intuitive that appearance may be the more
important modality, which is what we are seeking to confirm
here. For appearance input is the single target frame, and for
motion the optical flow between the target frame and the frame
directly preceding it. We use Dual TV-L1optical flow [46],
which produces two channels of optical flow corresponding to
the horizontal and vertical components, as opposed to three
RGB channels for frames.
0a) Small instantiation A five-layer CNN of the architecture
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
0a) 2D CNNa1a) 3D CNN 2a) CNN-LSTM 3a) Two-Streama4a) SlowFastb
Layer frame
data 1282×3|216×1282×3 16 ×1282×31282×3
16 ×1282×3
str. 12
conv1 32,32
str. 12
1282×32 3 ×32,32
str. 1×12
16×1282×32 32,32
str. 12
16 ×1282×32 32,32
str. 12
str. 1×12
16 ×1282×8
pool1 22
str. 22
642×32 2 ×22
str. 2×22
8×642×32 22
str. 22
16 ×642×32 22
str. 22
str. 1×22
16 ×642×8
conv2 32,32
str. 12
642×32 3 ×32,32
str. 1×12
8×642×32 32,32
str. 12
16 ×642×32 32,32
str. 12
str. 1×12
16 ×642×8
pool2 22
str. 22
322×32 2 ×22
str. 2×22
4×322×32 22
str. 22
16 ×322×32 22
str. 22
str. 1×22
16 ×322×8
conv3 32,64
str. 12
322×64 3 ×32,64
str. 1×12
4×322×64 32,64
str. 12
16 ×322×64 32,64
str. 12
str. 1×12
16 ×322×16
pool3 22
str. 22
162×64 2 ×22
str. 2×22
2×162×64 22
str. 22
16 ×162×64 22
str. 22
str. 1×22
16 ×162×16
conv4 32,64
str. 12
162×64 3 ×32,64
str. 1×12
2×162×64 32,64
str. 12
16 ×162×64 32,64
str. 12
str. 1×12
16 ×162×16
pool4 22
str. 22
82×64 2 ×22
str. 2×22
1×82×64 22
str. 22
16 ×82×64 22
str. 22
str. 1×22
16 ×82×16
fusion 82×64 82×64
flatten 4096 4096 16 ×4096 4096 4096
dense 1024 1024 16 ×1024 1024 1024
lstm 16 ×128
dense 2 2 16 ×2 2 2
aFor 2D CNN and Two-Stream, colors red and blue highlight how dimensions differ between frames and flows.
bFor SlowFast, colors orange|cyan highlight the differences in model parameters and dimensions between the slow and fast pathways.
cOnly for flow input; Serves the purpose of producing 3 channels for transfer learning.
type popularised by AlexNet [47].
0b) ResNet-50 instantiation We adopt the architecture given
by [45], which allows us to use pre-trained models.
1) 3D CNN: This model has the ability to learn spatio-
temporal features. We extend the 2D CNN introduced in the
previous section by using 3D instead of 2D convolutions. The
third dimension corresponds to the temporal context. We use
temporal pooling following the slow fusion approach [19].
1a) Small instantiation Extending the small 2D CNN to 3D,
we use temporal convolution kernels of size 3 as recom-
mended by [42]; temporal pooling is realized in the max
pooling layers.
1b) ResNet-50 instantiation We extend ResNet-50 [45] to
3D, but modify the dimensions to fit our input, since we
do not use transfer learning for the 3D CNN. Within each
block, the first convolutional layer has a temporal kernel
size of 3, a choice adopted from [16]. Temporal fusion
is facilitated by using temporal stride 2 in the second
convolutional layer of the first block in each block layer.
2) CNN-LSTM: The CNN-LSTM adds an LSTM layer to
model a sequence of high-level features learned from raw
frames. Note that this does not allow the model to learn low-
level spatio-temporal features (as opposed to 3D CNN). Given
the clear temporal structure of intake gestures (movement
towards the mouth and back), it does however seem intuitive
that knowledge of the development of high-level features from
temporal context could help predict the current frame.
2a) Small instantiation We use the features from the first
dense layer of the small 2D CNN described previously
as input into one LSTM layer with 128 units.
2b) ResNet-50 instantiation The spatially pooled output of a
ResNet-50’s [45] last block is used as input into one
LSTM layer with 128 units.
3) Two-Stream: For our instantiations of the Two-Stream
approach, we follow the original work by Simonyan and Zis-
serman [15] to select the model input: The appearance stream
takes the target frame as input; meanwhile, the motion stream
is based on the stacked horizontal and vertical components
of optical flow calculated using Dual TV-L1from pairs of
consecutive frames in the temporal context.
3a) Small instantiation Motion and appearance stream both
follow the small 2D CNN architecture; after the last
pooling layer, the streams are pooled using spatially
aligned conv fusion as proposed by [44].
3b) ResNet-50 instantiation Motion and appearance stream
both follow the ResNet-50 [45] architecture; after the last
block layer, the streams are pooled using spatially aligned
conv fusion [44].
4) SlowFast: The SlowFast model processes the temporal
context at two different temporal resolutions. Since our dataset
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Layer 0b) 2D CNNa1b) 3D CNN 2b) CNN-LSTM 3b) Two-Streama4b) SlowFastb
data 2242
16 ×1282
16 ×2242
16 ×1282×3
stride 12
stride 12
conv1 72,64
stride 22
stride 1×12
16 ×1282
stride 22
16 ×1122
stride 22
stride 1×12
16 ×1282×8
pool1 32
stride 22
stride 2×22
stride 22
16 ×562
stride 22
stride 1×22
16 ×642×8
16 ×562
16 ×642×32
16 ×282
16 ×322×64
16 ×142
16 ×162×128
16 ×72
16 ×82×256
fusion 72×2048 1 ×12×2560
16 ×12×
flatten 2048 2048 16 ×2048 2048 2560
lstm 16 ×128
dense 2 2 16 ×2 2 2
aFor 2D CNN and Two-Stream, colors red and blue highlight how dimensions differ between frames and flows.
bFor SlowFast, colors orange|cyan highlight the differences in model parameters and dimensions between the slow and fast pathways.
cOnly for flow input; Serves the purpose of producing 3 channels for transfer learning.
has fewer frames than in the original work [16], we choose the
factors α= 4 and β= 0.25 for our SlowFast instantiations.
4a) Small instantiation Both pathways are based on the small
2D CNN; we extend the convolutional layers to 3D and
set the temporal kernel size to 1for the slow pathway
and to 3for the fast pathway. Following [16], we choose
time-strided convolutions of kernel size 3×12for a lateral
connection after each of the four convolutional layers.
Fusion consists of temporal average pooling and spatially
aligned 2D conv fusion [44].
4b) ResNet-50 instantiation We directly follow [16] who
themselves used ResNet-50 as backbone for SlowFast,
only using the same dimension tweaks as in our ResNet-
50 2D CNN. Fusion consists of global average pooling
and concatenation.
B. Loss calculation
We use cross-entropy loss for all our models. At evaluation
time, we only consider the target frame for prediction, which
corresponds to the last frame of the input (see Fig. 2). The
same applies to loss calculation during training for all models
except CNN-LSTM: Following [20], we train the CNN-LSTM
using the labels of all input frames, but evaluate only using
the label of the target frame.
Due to the nature of our data, the classes are very im-
balanced with many more “non-intake” frames than “intake”
frames. When computing mini-batch loss, we correct for this
imbalance by using weights calculated as
to scale the loss for the mlabels y={y1, ..., ym}in each
minibatch, where nis the number of classes and C(i)is the
number of elements of ywhich equal yi.
C. Stage II: Detecting intake gestures
We follow a maximum search approach [6] to determine
the sparse individual intake gesture times. Based on estimated
frame-level probabilities p, we first derive p0by setting all
probabilities below a threshold ptto zero. This leaves all
frames {f:pintake,f pt}as candidates for detections,
as seen at the bottom of Fig. 2. Subsequently, we perform
a search for local maxima in p0with a minimum distance d
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Crop Crop Crop Crop
Fig. 3. Recording of one session. The spherical video is remapped to equirectangular representation, cropped, and reshaped to square shape.
Training Validation Test Total
Type # Mean [s] Std [s] # Mean [s] Std [s] # Mean [s] Std [s] # Mean [s] Std [s]
Participants 62 802.25 243.31 20 785.97 243.74 20 891.00 222.27 102 816.46 240.07
Intake Gestures 2924 2.35 1.03 952 2.28 1.00 997 2.29 1.01 4891 2.32 1.02
between maxima. The intake gesture times are then inferred
from the indices of the maxima.
We are not aware of any publicly available dataset including
labeled video data of intake gestures. Related studies that
involved collection of video data as ground truth typically do
not make the video data available, and instead focus on the
inertial [6] and audio sensor data [48].
For this research, we collected and labeled video data of 102
participants consuming a standardized meal of lasagna, bread,
yogurt, and water in a group setting (ethics approval H-2017-
0208). The data was collected in sessions of four participants
at a time seated around a round table in a closed room without
outside interference. Participants were invited to consume their
meal in a natural way1and encouraged to have a conversation
in the process. A 360fly-4K camera was placed in the center
of the table, recording all four participants simultaneously. As
illustrated in Fig. 3, raw spherical video was first remapped to
equirectangular representation. We then cropped out a separate
video for each individual participant such that the dimensions
include typical intake gestures. Each video was trimmed in
time to only include the duration of the meal, and spatially
scaled to a square shape.
Two independent annotators labeled and cross-checked
the intake gestures in all 102 videos as durations using
ChronoViz2. Each gesture is assigned as start timestamp the
point where the final uninterrupted movement towards the
mouth starts; as end timestamp, it is assigned the point when
the participant has finished returning their hand(s) from the
movement or started a different action. Based on the start
and end timestamps, we derive a label for each video frame
according to the following procedure: If a video frame was
taken between start and end of an annotated gesture, it is
1After the meal, 64 of the 102 participants (63%) responded to the statement
“The presence of the video camera changed my eating behavior” (5-point
Likert scale, ranging from (1) strongly disagree to (5) strongly agree). With
an average score of 2.11, we conclude that participants did not feel that the
presence of the camera considerably affected their eating behavior.
assigned the label “intake”. If a video frame is taken outside of
any annotated gestures, it is assigned the label “non-intake”.
The dataset is available from the authors on request.
We use a global split of our dataset into 62 participants
for training, 20 participants for validation, and 20 participants
for test as summarized in Table III. To reduce computational
burden, we downsample the video from 24 fps to 8 fps, and
resize to dimensions 140x140 (128x128 after augmentation).
A. Stage I: Estimating frame-level intake probability
We apply the models introduced in Section III to classify
frames according to the two labels “intake” and “non-intake”.
For our experiments, we distinguish between models without
and with temporal context:
Models without temporal context (0a-0b) are of interest
as a baseline, and to experimentally compare appearance
and motion features. For appearance, input is the single
target frame, and for motion, optical flow between the
target frame and the one preceding it.
For the models with temporal context (1a-4b), input
consists of 16 frames, which corresponds to 2 seconds at
8 fps. The last of these frames is the prediction target. To
take maximum advantage of the available training data,
we generate input using a window shifting by one frame.
The use of temporal context implies that the first 15 labels
are not predicted.
1) Training: We use the Adam optimizer to train each
model on the training set. Training runs for 60 epochs with
a learning rate starting at 3e-4 and exponentially decaying
at a rate of 0.9 per epoch. Models without temporal context
are trained using batch size 64, while models with temporal
context are trained using batch size 8.3Using the validation
set, we keep track of the best models in terms of unweighted
average recall (UAR), which is not biased by class imbalance.
3Batch sizes were chosen considering space constraints training on NVIDIA
Tesla V100 at fp32 accuracy, and to be consistent across models.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Model Features usedaTemporal Stage I Stage IIc
Frames Flows contextb#Params UAR pˆ
0a) Small 2D CNN X
4.26M 82.63% 0.957 670 39 287 321 0.674
0a) Small 2D CNN X4.26M 71.76% 0.793 662 45 1023 329 0.487
0b) ResNet-50 2D CNN X23.5M 86.39% 0.964 829 54 211 162 0.795
0b) ResNet-50 2D CNN X23.5M 71.34% 0.865 661 53 1163 330 0.461
1a) Small 3D CNN X
4.39M 87.54% 0.997 795 37 169 196 0.798
2a) Small CNN-LSTM X4.85M 83.36% 0.983 674 17 104 317 0.755
3a) Small Two-Stream X X 4.34M 81.96% 0.973 653 36 185 338 0.700
4a) Small SlowFast X4.49M 88.71% 0.996 754 31 103 237 0.803
1b) ResNet-50 3D CNN X32.2M 88.77% 0.992 775 25 54 216 0.840
2b) ResNet-50 CNN-LSTM X24.6M 89.74% 0.996 791 29 38 200 0.856
3b) ResNet-50 Two-Stream X X 47.0M 85.25% 0.997 806 49 82 185 0.836
4b) ResNet-50 SlowFast X36.7M 89.01% 0.987 824 23 83 167 0.858
aFrame (appearance) features are raw frames; Flow (motion) features are optical flow computed between adjacent frames.
bTemporal context consists of 16 frames, the last of which is the target frame.
cDownsampling to 8 fps causes temporally close events to merge, hence total number of intake gestures in the test set is 991.
Fig. 4. The evaluation scheme proposed by Kyritsis et al. [6]. (1) A true
positive is the first detection within each ground truth event; (2) False positives
of type 1 are further detections within the same ground truth event; (3)
False positives of type 2 are detections outside ground truth events; (4) False
negatives are non-detected ground truth events.
For regularization, we use l2 loss with a lambda of 1e-4.
Dropout is used in all small instantiations of our models on
convolutional and dense layers with rate 0.5, but we do not use
dropout for the ResNet-50 instantiations. We also use data aug-
mentation by dynamically applying random transformations:
Small rotations, cropping to size 128x128, horizontal flipping,
brightness and contrast changes. All models are learned end-
to-end, optical flow is precomputed using Dual TV-L1[46].
2) Transfer learning and warmstarting for better initial
parameters: While the initial small 2D CNN is trained from
scratch, we use it to warmstart the convolutional layers of
both the small CNN-LSTM and the small Two-Stream model.
The ResNet-50 2D CNN is initialized using an off-the-shelf
ResNet-50 trained on the ImageNet database. To fit ImageNet
dimensions, we resize our inputs for this model to 224x224,
as listed in Table II. We use the ResNet-50 2D CNN to
warmstart the convolutional layers of both the ResNet-50
CNN-LSTM and the ResNet-50 Two-Stream model. All 3D-
CNN and SlowFast models are trained from scratch.
B. Stage II: Detecting intake gestures
For the detection of intake gestures, we build on the
exported frame-level probabilities using the models trained in
Stage I. We then apply the approach described in Section III-C
to determine sparse detections.
1) Evaluation scheme: We use the evaluation scheme pro-
posed by Kyritsis et al. [6] as seen in Fig. 4. According to the
scheme, one correct detection per ground truth event counts
as a true positive (TP), while further detections within the
ows bothframes
(a) Features used
without with
(b) Temporal context
small ResNet
(c) Model depth
Fig. 5. Comparing model performance in terms of F1scores. It is apparent
that (a) models using frames as features perform better than models using
optical flow, (b) models with temporal context tend to perform better than
models without, and (c) larger (deeper) models tend to perform better. Models
are color-coded according to Table IV.
same ground truth event are false positives of type 1 (FP 1).
Detections outside ground truth events are false positives of
type 2 (FP 2), and non-detected ground truth events count
as false negatives (FN ). Based on the aggregate counts, we
calculate precision ( TP
TP+FP 1+FP 2), recall ( TP
TP+FN ), and the
F1score (2Precision Recal l
Precision +Recal l ).
2) Parameter setting: The approach described in Section
III-C requires setting two hyperparameters: The minimum
distance between detections d, and the threshold pt. We follow
Kyritsis et al. [6] and set d= 2s, which approximates the
mean duration of intake gestures, see Table III. Since we only
run one final evaluation of each model on the test set, we use
the validation set to approximate a good threshold pt. Hence,
for each model, we run a grid search between 0.5and 1on
the validation set using a step size of 0.001 and choose the
threshold that maximizes F1. Table IV lists the final pˆ
C. Results
The best result is achieved by the state-of-the-art ResNet-50
SlowFast network with an F1score of 0.858. In general, we
find that model accuracy is impacted by three factors of model
choice, namely (i) frame or flow features, (ii) with or without
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
(a) 2D CNN (flow) (b) 2D CNN (frame) (c) 3D CNN
(d) CNN-LSTM (e) Two-Stream (f) SlowFast
Fig. 6. Aggregating pintake by model for all ground truth events in the vali-
dation set. Predictions have been aligned in time and linearly interpolated. We
plot the median and [q25, q75 ]interval for small and ResNet-50 instantiations
respectively. Models are color-coded according to Table IV.
temporal context, and (iii) model depth. Fig. 5 illustrates this
by plotting the F1values grouped by each of these factors.
1) Frame and flow features, Fig. 5 (a): Using the 2D
CNN, we are able to directly compare how frame (appearance)
and flow (motion) features affect model performance. For
the small and ResNet instantiations, frame features lead to a
relative F1improvement of 38% and 72% over flow features.
An improvement is also measurable for UAR. Further, the
Two-Stream models, which mainly rely on flow features,
perform worse than the other models with temporal context.
We can conclude that for detection of intake gestures, more
information is carried by visual appearance than by motion.
2) Temporal context, Fig. 5 (b): To assess the usefulness
of temporal context, we compare the accuracies of our models
with and without temporal context. The straightforward exten-
sion of Small 2D CNN to Small 3D CNN adds a 17% relative
F1improvement. Comparing the best models with (ResNet-50
SlowFast) and without temporal context (ResNet-50 2D CNN),
we find a relative F1improvement of 8%. We conclude that
temporal context considerably improves model accuracy.
Considering model choice, we observe that the Small 3D
CNN is superior to its CNN-LSTM counterpart, however the
opposite is true for the ResNet-50 instantiations. This may be
due to the fact that for the ResNet instantiations, the CNN-
LSTM is pre-trained on ImageNet, while the 3D CNN is
not. We conclude that the 3D CNN could be useful for slim
models (e.g., for mobile devices), but for larger models, all
architectures with temporal context should be considered.
3) Model depth, Fig. 5 (c): We also see that the deeper
ResNet-50 instantiations achieve higher F1scores than the
small ones for all combinations except the flow-based 2D
CNN. Note that the improvement due to model depth is
especially noticeable in the F1score, and less so in UAR.
D. Why do frame features perform better?
To help explain why frames perform better as features for
this task, we took a closer look at some example model
predictions from the validation set. It appears that flows are
in general useful as features; however, the data shows that
(a) Cutting lasagne (b) Preparing intake
Ground truth Small 2D CNN (frames)
ResNet 2D CNN (frames)
Small 2D CNN (ow)
ResNet 2D CNN (ow)
Fig. 7. Example situations showing uncertainty of flow (motion) models
compared to frame (appearance) models. Section S1 of the supplementary
material provides a multi-frame version of this figure.
(a) Raised fork (b) Blowing nose
Ground truth Small 2D CNN (without)
ResNet 2D CNN (without)
Small 3D CNN (with)
ResNet 3D CNN (with)
Fig. 8. Example situations where models with temporal context are superior
to models without temporal context. Section S2 of the supplementary material
provides a multi-frame version of this figure.
in comparison to frame models, flow models are less certain
about their predictions. For example, Fig. 7 (a) shows how
during periods with no intake gestures, small movements such
as using cutlery can cause higher uncertainty in flow models.
On the other hand, Fig. 7 (b) shows how flow models are
also overall less confident when correctly identifying intake
gestures. This can also be observed by looking at aggregated
predictions for all events in Fig. 6: Models based solely
on flows (a) are less certain about predictions, while their
predictions also contain more variance than models based on
frames (b). Further, this is also reflected in the lower thresholds
required to trigger a detection for flow models, as is evident
from Table IV. These lower thresholds and uncertainty are
linked to the large number of false positives of these models.
E. Why do models with temporal context perform better?
Our results show that while models based on single frames
perform reasonably well, there is measurable improvement
when adding temporal context. Hence, we also looked at this
comparison for example model predictions from the validation
set to help make the difference easier interpretable. Indeed,
in some cases, it appears intuitive to a human observer
how the temporal context is helpful to interpret the target
frame. For example, in Fig. 8 (a), the participant keeps the
fork raised after completing an intake gesture. A frame by
itself can seem to be part of an intake gesture, while the
participant is actually resting this way or is being interrupted.
Without temporal context, the 2D CNN models are unaware
of this context, resulting in poor performance. Availability of
temporal context also helps models to become more confident
in their predictions. Further, errors due to outliers are more
easily avoidable with temporal context, such as blowing nose
in Fig. 8 (b). On the aggregate level, Fig. 6 illustrates how
predictions by models with temporal context (c)-(f) have a
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
(a) Eating bread crust (b) Licking finger
(c) Sipping water (d) Too hot
Ground truth ResNet 3D CNN ResNet CNN-LSTM ResNet SlowFast
Fig. 9. Example situations where the best models cause false negatives (a-b)
and false positives of type 1 (c) and type 2 (d). Note that a true positive
always preceeds false positives of type 1. Section S3 of the supplementary
material provides a multi-frame version of this figure.
more snug fit with the ground truth events, and less variance
in their predictions.
F. Where do the models struggle?
Examining the results in Table IV, it appears that mainly
false negatives and also false positives of type 2 are prob-
lematic for our best models. To help understand in what
circumstances the models struggle, we compiled examples
from the validation set where the best models make mistakes.
The examples show that these mistakes tend to happen in cases
of “outlier behavior” that differs substantially from the typical
behavior in the dataset. False negatives occur mostly for intake
gestures that are less noticeable or less common, such as eating
bread crust or licking a finger, see Fig. 9 (a) and (b). An
example for false positives of type 2 is when the participant
interrupts an intake gesture as depicted in Fig. 9 (d). We see
false positives of type 1 as mostly representing a shortcoming
of the Stage II approach, i.e., when the duration of an intake
gesture exceeds 2 seconds, seen in Fig. 9 (c).
In this paper, we have demonstrated the feasibility of detect-
ing intake gestures from video sourced through a 360-degree
camera. Our two-stage approach involves learning frame-level
probabilities using deep architectures proposed in the context
of video action recognition (Stage I), and a search algorithm to
detect individual intake gestures (Stage II). Through evaluation
of a variety of models, our results show that appearance
features in form of the individual raw frames are well suited
for this task. Further, while single frames on their own can
lead to useful results with F1of up to 0.795, the best model
considering a temporal context of multiple frames achieves a
superior F1of 0.858. This result is achieved with a state-of-
the-art SlowFast network [16] using ResNet-50 as backbone.
Overall, we see several benefits and opportunities that the
use of video holds for dietary monitoring. First, the prolif-
eration of 360 degree video reduces the practical challenges
of recording images of human eating occasions. This could
be used to capture the intake of multiple individuals with
a single camera positioned in the center of a table (e.g.,
families eating from a shared dish [49]). Second, the models
could be leveraged to support dietitians in reviewing videos of
intake occasions. For instance, instead of watching a twenty
minute video, imagery of the actual intake gestures could
be automatically extracted for assessment. Third, the models
could be used to semi-automate the ground truth annotation
process (e.g., for inertial sensors) by pre-annotating the videos.
Finally, the models could be used to further the development
of fully automated dietary monitoring [7] (e.g., care-taking
robots, life-logging, patient monitoring).
As a limitation of our approach, we noted that the distribu-
tion of participant behavior has a “fat tail” as it includes many
examples of outlier behavior that models misinterpret (e.g.,
sudden interruption due to a conversation, blowing on food).
To deal with such events, future research may employ larger
databases of samples to train models. Further, in comparison
to approaches based on inertial sensors, our approach has a
limitation in that it requires the participant to consume their
meal at a table equipped with a camera. Hence, our vision
models should be directly benchmarked against models based
on inertial sensor data to determine their relative strengths and
weaknesses. Going one step further, fusion of both modalities
could also be explored. Finally, Stages I and II could be unified
into a single end-to-end learning model using CTC loss [50],
which may alleviate some of the shortcomings of the current
approach. However, it needs to be considered that (i) this
is directly only feasible for the CNN-LSTM model without
increasing the requirement for GPU memory, and (ii) a larger
temporal context and dataset may be required.
We gratefully acknowledge the support by the Bill &
Melinda Gates Foundation [OPP1171389]. This work was
additionally supported by an Australian Government Research
Training (RTP) Scholarship.
[1] C. Weekes, A. Spiro, C. Baldwin, K. Whelan, J. Thomas, D. Parkin,
and P. Emery, “A review of the evidence for the impact of improving
nutritional care on nutritional and clinical outcomes and cost,” J. Human
Nutrition Dietetics, vol. 22, pp. 324–335, 2009.
[2] P. V. Rouast, M. T. P. Adam, T. Burrows, R. Chiong, and M. E. Rollo,
“Using deep learning and 360 video to detect eating behavior for user
assistance systems,” in Proc. Europ. Conf. Information Systems, 2018.
[3] WHO, “Noncommunicable diseases progress monitor, 2017,” Geneva:
World Health Organization, Tech. Rep., 2017.
[4] S. W. Lichtman, K. Pisarska, E. R. Berman, M. Pestone, H. Dowling,
E. Offenbacher, H. Weisel, S. Heshka, D. E. Matthews, and S. B. Heyms-
field, “Discrepancy between self-reported and actual caloric intake and
exercise in obese subjects,New England J. Medicine, vol. 327, no. 27,
pp. 1893–1898, 1992.
[5] T. Vu, F. Lin, N. Alshurafa, and W. Xu, “Wearable food intake
monitoring technologies: A comprehensive review,” Computers, vol. 6,
no. 1, p. 4, 2017.
[6] K. Kyritsis, C. Diou, and A. Delopoulos, “Modeling wrist micromove-
ments to measure in-meal eating behavior from inertial sensor data,”
IEEE J. Biomedical and Health Informatics, 2019.
[7] S. Hantke, F. Weninger, R. Kurle, F. Ringeval, A. Batliner, A. E.-
D. Mousa, and B. Schuller, “I hear you eat and speak: Automatic
recognition of eating condition and food type, use-cases, and impact
on asr performance,” PloS one, vol. 11, no. 5, p. e0154486, 2016.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2019.2942845, IEEE Journal of
Biomedical and Health Informatics
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
[8] E. Thomaz, I. Essa, and G. D. Abowd, “A practical approach for
recognizing eating moments with wrist-mounted inertial sensing,” in
Proc. UbiComp. ACM, 2015, pp. 1029–1040.
[9] M. Mirtchouk, C. Merck, and S. Kleinberg, “Automated estimation of
food type and amount consumed from body-worn audio and motion
sensors,” in Proc. UbiComp, 2016, pp. 451–462.
[10] A. Braeken, P. Porambage, A. Gurtov, and M. Ylianttila, “Secure and
efficient reactive video surveillance for patient monitoring,Sensors,
vol. 16, no. 1, pp. 1–13, 2016.
[11] A. Hall, C. B. Wilson, E. Stanmore, and C. Todd, “Implementing moni-
toring technologies in care homes for people with dementia: a qualitative
exploration using normalization process theory,Int. J. Nursing Studies,
vol. 72, pp. 60–70, 2017.
[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,Nature, vol. 521,
pp. 436–444, 2015.
[13] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 1, pp. 221–231, 2013.
[14] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proc. CVPR, 2015,
pp. 2625–2634.
[15] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” in Proc. NIPS, 2014, pp. 568–576.
[16] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for
video recognition,” arXiv preprint arXiv:1812.03982, 2018.
[17] G. Ciocca, P. Napoletano, and R. Schettini, “Learning cnn-based features
for retrieval of food images,” in Proc. Int. Conf. Image Analysis and
Processing, 2017, pp. 426–434.
[18] C. Feichtenhofer, A. Pinz, R. P. Wildes, and A. Zisserman, “What have
we learned from deep representations for action recognition?” in Proc.
CVPR, 2018, pp. 7844–7853.
[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proc. CVPR, 2014, pp. 1725–1732.
[20] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
model and the kinetics dataset,” in Proc. CVPR, 2017, pp. 4299–4308.
[21] G. Block, “A review of validations of dietary assessment methods,” Am.
J. Epidemiology, vol. 115, no. 4, pp. 492–505, 1982.
[22] J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking
recipe retrieval,” in Proc. Multimedia Conf., 2016, pp. 32–41.
[23] G. Ciocca, P. Napoletano, and R. Schettini, “Cnn-based features for
retrieval and classification of food images,Comput. Vision Image
Understanding, vol. 176, pp. 70–77, 2018.
[24] M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition
and volume estimation of food intake using a mobile device,” in Proc.
Workshop on Applications of Computer Vision, 2009, pp. 1–8.
[25] W. Zhang, Q. Yu, B. Siddiquie, A. Divakaran, and H. Sawhney, ““snap-
n-eat” food recognition and nutrition estimation on a smartphone,” J.
Diabetes Science Technol., vol. 9, no. 3, pp. 525–533, 2015.
[26] Y. Dong, J. Scisco, M. Wilson, E. Muth, and A. Hoover, “Detecting
periods of eating during free-living by tracking wrist motion,” IEEE J.
Biomedical and Health Informatics, vol. 18, no. 4, pp. 1253–1260, 2014.
[27] X. Ye, G. Chen, Y. Gao, H. Wang, and Y. Cao, “Assisting food journaling
with automatic eating detection,” in Proc. CHI Conf. Extended Abstracts
on Human Factors in Computing Systems. ACM, 2016, pp. 3255–3262.
[28] E. Robinson, E. Almiron-Roig, F. Rutters, C. de Graaf, C. G. Forde,
C. Tudur Smith, S. J. Nolan, and S. A. Jebb, “A systematic review and
meta-analysis examining the effect of eating rate on energy intake and
hunger,Am. J. Clinical Nutrition, vol. 100, no. 1, pp. 123–151, 2014.
[29] H. Heydarian, M. Adam, T. Burrows, C. Collins, and M. E. Rollo,
“Assessing eating behaviour using upper limb mounted motion sensors:
A systematic review,” Nutrients, vol. 11, no. 5, p. 1168, 2019.
[30] O. Amft, M. Stager, P. Lukowicz, and G. Troster, “Analysis of chewing
sounds for dietary monitoring,” in Proc. UbiComp, 2005, pp. 56–72.
[31] S. P¨
aßler, M. Wolff, and W.-J. Fischer, “Food intake monitoring: an
acoustical approach to automated food intake activity detection and
classification of consumed food,” Physiological Measurement, vol. 33,
no. 6, pp. 1073–1093, 2012.
[32] E. S. Sazonov and J. M. Fontana, “A sensor system for automatic
detection of food intake through non-invasive monitoring of chewing,”
IEEE Sensors Journal, vol. 12, no. 5, pp. 1340–1348, 2012.
[33] O. Amft, H. Junker, and G. Troster, “Detection of eating and drinking
arm gestures using inertial body-worn sensors,” in Proc. Int. Symp.
Wearable Computers. IEEE, 2005, pp. 160–163.
[34] Y. Shen, J. Salley, E. Muth, and A. Hoover, “Assessing the accuracy of
a wrist motion tracking method for counting bites across demographic
and food variables,IEEE J. Biomedical and Health Informatics, vol. 21,
no. 3, pp. 599–606, 2017.
[35] S. Zhang, W. Stogin, and N. Alshurafa, “I sense overeating: Motif-
based machine learning framework to detect overeating using wrist-worn
sensing,” Information Fusion, vol. 41, pp. 37–47, 2018.
[36] J. Gao, A. G. Hauptmann, A. Bharucha, and H. D. Wactlar, “Dining
activity analysis using a hidden markov model,” in Proc. Int. Conf.
Pattern Recognition. IEEE, 2004, pp. 915–918.
[37] K. Okamoto and K. Yanai, “Grillcam: A real-time eating action recogni-
tion system,” in Proc. Int. Conf. Multimedia Modeling. Springer, 2016,
pp. 331–335.
[38] H. M. Hondori, M. Khademi, and C. V. Lopes, “Monitoring intake
gestures using sensor fusion (microsoft kinect and inertial sensors) for
smart home tele-rehab setting,” in Proc. Healthcare Innovation Conf.,
2012, pp. 36–39.
[39] J. S. Tham, Y. C. Chang, and M. F. A. Fauzi, “Automatic identification
of drinking activities at home using depth data from rgb-d camera,” in
Proc. Int. Conf. Control, Automation and Information Sciences. IEEE,
2014, pp. 153–158.
[40] A. Klaser, M. Marszałek, and C. Schmid, “A spatio-temporal descriptor
based on 3d-gradients,” in Proc. BMVC, 2008, pp. 1–10.
[41] H. Wang, A. Kl¨
aser, C. Schmid, and L. Cheng-Lin, “Action recognition
by dense trajectories,” in Proc. CVPR, 2011, pp. 3169–3176.
[42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proc. ICCV,
2015, pp. 4489–4497.
[43] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[44] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proc. CVPR, 2016, pp.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. CVPR, 2016, pp. 770–778.
[46] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
tv-l1 optical flow,” in DAGM: Pattern Recognition, 2007, pp. 214–223.
[47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Proc. NIPS, 2012, pp.
[48] C. Merck, C. Maher, M. Mirtchouk, M. Zheng, Y. Huang, and S. Klein-
berg, “Multimodality sensing for eating recognition,” in Proc. Int. Conf.
Pervasive Computing Technologies for Healthcare, 2016, pp. 130–137.
[49] T. Burrows, C. Collins, M. T. P. Adam, K. Duncanson, and M. Rollo,
“Dietary assessment of shared plate eating: A missing link,” Nutrients,
vol. 11, no. 4, pp. 1–14, 2019.
[50] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connection-
ist temporal classification: labelling unsegmented sequence data with
recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
Philipp V. Rouast received the B.Sc. and M.Sc.
degrees in Industrial Engineering from Karlsruhe
Institute of Technology, Germany, in 2013 and
2016 respectively. He is currently working towards
the Ph.D. degree in Information Systems and is
a graduate research assistant at The University of
Newcastle, Australia. His research interests include
deep learning, affective computing, HCI, and re-
lated applications of computer vision. Find him at
Marc T. P. Adam is a Senior Lecturer in Computing
and Information Technology at the University of
Newcastle, Australia. In his research, he investigates
the interplay of human users’ cognition and affect
in human-computer interaction. He is a founding
member of the Society for NeuroIS. He received an
undergraduate degree in Computer Science from the
University of Applied Sciences W¨
urzburg, Germany,
and a PhD in Economics of Information Systems
from Karlsruhe Institute of Technology, Germany.
... Two input descriptors are used for action representation. Another work includes is, in which Ahmad Z. et al. [80] proposes a technique for Human Action Recognition (HAR) that uses a Convolutional Neural Network (CNN). The depth data sequences from the motion sensing devices are converted into images and fed into a CNN rather than using any conventional or statistical method. ...
... a. Handling complex anomaly detection in crowded environment: The existing summarization techniques show least performance when the video comprises of crowded environment or complex scenario such as terrorist attack, suicide bomber in public rally, or theft in crowd. Hence, handling of these complex scenarios always leads to lesser accuracy while detecting the normal and abnormal activity or summarizing the video content [80,95]. Hence the major challenges here is to differentiate between the normal and abnormal scenes, like in weather condition a slight deviation from normal can be considered as normal but in case of medical a slight deviation from normal is considered as abnormal. ...
Full-text available
The exponential growth in the usage of computing technologies in various applications has led to the creation of huge amount of multimedia information such as, video, audio, and text. The enormous amount of video data generated over the past years necessitates the use of video summarization techniques that has become an emerging field of research. These techniques may facilitate quick browsing, indexing and faster sharing of content among various sources. Video summarization has been popular method to generate a short summary of a longer sized video and these approaches may be broadly classified into handcrafted (using features descriptors) or deep learning (DL) based algorithms. In this paper, we expound a comprehensive review of state-of-the-art (SOTA) techniques for video summarization from traditional to modern data-driven approaches. In addition, we proposed a taxonomy for the classification of video summarization methods based on a plenty of criteria. We also present an analysis of evaluation protocols for these approaches using benchmark datasets and performance metrices. We identify and list various research challenges specifically for each sub-category of video summarization. It may be clearly inferred that modern deep learning-based approaches outperformed traditional methods in terms of accuracy with an additional training overhead. Furthermore, most of the handcrafted-based approaches offer limited performance in dynamic video scenario and there exist several inconsistencies such as scaling or rotational variations under different illumination conditions. Besides, our analysis investigates that multi-criteria-based video summarization is an area that requisite further exploration by the research community. This survey may serve as a reference article to the new researchers for carrying out investigations in this active field of computer vision.
... The video analysis approaches are widely deployed in a variety of applications such as surveillance [21,66,70], video forensic, traffic monitoring [105], smart vehicles [23,34], smart education [41], healthcare [42,65], and etc. The smart city is an emerging concept in many countries, where services to the citizens are offered by government by intelligent computational mechanisms. ...
Full-text available
The multifarious video contents generated through various applications is growing exponentially resulting in huge volumes. Analyzing these contents manually is a cumbersome task particularly in terms of extracting key information by satisfying a criterion. The concept of smart city is popular across the world to develop technology-driven cities where facilities are offered intelligently. In smart cities, the video analysis may play a crucial role in several real-life applications such as, smart security surveillance, traffic monitoring, video forensics, sports, entertainment, medical, etc. Thus, video analysis plays a vital role where larger real-time videos can be intelligently analyzed to detect key interesting patterns to yield an application-centric shorter summary. Moreover, machine or deep paradigms can be applied on video data generated in various smart applications in the smart cities to create a real-time model. In this study, we explore the usage of video analysis in various real-time applications in smart cities. Hence, the work aims to expound a detailed investigation of computer vision-based video analysis approaches in various aspects of smart cities. Besides, a generic video analysis layered architecture is also presented which highlights the deployment of video analysis-centric approaches for real-life smart cities facilities. Our analysis of the existing approaches clearly demonstrates the pertinency of video analysis in several smart city’s mundane infrastructure. However, the study also reveals numerous scopes where video analysis yet to be explored and that offers a clear insight to the researchers. In addition to opportunities, our study identifies some open research challenges to the active research community. Moreover, the survey can serve as a reference to the investigators as well as to the planning and development authorities of smart cities.
... Computer hardware advancements have enabled deep learning to play a greater role in Computer Vision (CV). All the time, Convolutional Neural Networks (CNNs) [7] have been able to effectively execute tasks such as detection [8], segmentation [9] and classification [10]. In recent years, with the successful application of Transformers in Natural Language Processing (NLP), the Transformer-based model, namely Vision Transformer (ViT), have also shown the ability to compete with CNN in CV. ...
Full-text available
Knee OsteoArthritis (KOA) is a prevalent musculoskeletal disorder that causes decreased mobility in seniors. The lack of sufficient data in the medical field is always a challenge for training a learning model due to the high cost of labelling. At present, deep neural network training strongly depends on data augmentation to improve the model's generalization capability and avoid over-fitting. However, existing data augmentation operations, such as rotation, gamma correction, etc., are designed based on the data itself, which does not substantially increase the data diversity. In this paper, we proposed a novel approach based on the Vision Transformer (ViT) model with Selective Shuffled Position Embedding (SSPE) and a ROI-exchange strategy to obtain different input sequences as a method of data augmentation for early detection of KOA (KL-0 vs KL-2). More specifically, we fixed and shuffled the position embedding of ROI and non-ROI patches, respectively. Then, for the input image, we randomly selected other images from the training set to exchange their ROI patches and thus obtained different input sequences. Finally, a hybrid loss function was derived using different loss functions with optimized weights. Experimental results show that our proposed approach is a valid method of data augmentation as it can significantly improve the model's classification performance.
... The number of bites taken and the type of food consumed was then end-to-end deduced from the recorded egocentric dietary intake video. Recently, 360 • cameras have been used to monitor and assess dietary intake in food sharing [44], [45] or communal eating [46] scenarios in a passive way. Although visual-based monitoring can lead to more comprehensive dietary assessment, the use of the cameras to record often entails privacy issues. ...
Full-text available
Camera-based passive dietary intake monitoring is able to continuously capture the eating episodes of a subject, recording rich visual information, such as the type and volume of food being consumed, as well as the eating behaviors of the subject. However, there currently is no method that is able to incorporate these visual clues and provide a comprehensive context of dietary intake from passive recording (e.g., is the subject sharing food with others, what food the subject is eating, and how much food is left in the bowl). On the other hand, privacy is a major concern while egocentric wearable cameras are used for capturing. In this article, we propose a privacy-preserved secure solution (i.e., egocentric image captioning) for dietary assessment with passive monitoring, which unifies food recognition, volume estimation, and scene understanding. By converting images into rich text descriptions, nutritionists can assess individual dietary intake based on the captions instead of the original images, reducing the risk of privacy leakage from images. To this end, an egocentric dietary image captioning dataset has been built, which consists of in-the-wild images captured by head-worn and chest-worn cameras in field studies in Ghana. A novel transformer-based architecture is designed to caption egocentric dietary images. Comprehensive experiments have been conducted to evaluate the effectiveness and to justify the design of the proposed architecture for egocentric dietary image captioning. To the best of our knowledge, this is the first work that applies image captioning for dietary intake assessment in real-life settings.
... Of the papers these papers listed, all used classification of static frames (2D) to detect drinking. Rouast et al. as well as our previous work using RGB signals shows that using multiple frames for drink recognition can yield better results compared to individual frames [3], [15]. Rouast et al. previously showed this with meal-time events, comparing multiple 3D deep learning architectures to 2D deep learning architectures, achieving an accuracy of . ...
Full-text available
It is important for humans to remain hydrated, particularly for older adults who are at a greater risk of dehydration and may forget to drink. Monitoring liquid intake and getting reminders to drink throughout the day is a useful solution to increase hydration levels. The objective of this paper is to automatically detect drink events from multiple containers in a simulated home environment using a vision-based approach. The proposed work compares the use of depth and RGB (red, green, blue) cameras for this task. In this paper, we compared 2D and 3D Convolutional Neural Networks (CNN) using RGB and depth cameras. We collected data from nine participants performing drinking, eating and other Activities of Daily Living (ADL) in a simulated home environment. We found that for the 3D models, the RGB and depth camera inputs provided very similar F1-scores for both 10-Fold (94.3% vs 93.9%, respectively) and Leave-One-Subject-Out (LOSO) cross validation (84.2% vs 86.2%, respectively). This is a promising result as depth cameras also mitigate the challenges to privacy of RGB-based models. The 3D CNN models outperformed the 2D models, thereby creating a more robust system. Depth cameras are a useful alternative to RGB cameras with equal performance in identifying drinking events.
Unhealthy dietary habits are considered as the primary cause of various chronic diseases, including obesity and diabetes. The automatic food intake monitoring system has the potential to improve the quality of life (QoL) of people with diet-related diseases through dietary assessment. In this work, we propose a novel contactless radar-based approach for food intake monitoring. Specifically, a Frequency Modulated Continuous Wave (FMCW) radar sensor is employed to recognize fine-grained eating and drinking gestures. The fine-grained eating/drinking gesture contains a series of movements from raising the hand to the mouth until putting away the hand from the mouth. A 3D temporal convolutional network with self-attention (3D-TCN-Att) is developed to detect and segment eating and drinking gestures in meal sessions by processing the Range-Doppler Cube (RD Cube). Unlike previous radar-based research, this work collects data in continuous meal sessions (more realistic scenarios). We create a public dataset comprising 70 meal sessions (4,132 eating gestures and 893 drinking gestures) from 70 participants with a total duration of 1,155 minutes. Four eating styles (fork & knife, chopsticks, spoon, hand) are included in this dataset. To validate the performance of the proposed approach, seven-fold cross-validation method is applied. The 3D-TCN-Att model achieves a segmental F1-score of 0.896 and 0.868 for eating and drinking gestures, respectively. The results of the proposed approach indicate the feasibility of using radar for fine-grained eating and drinking gesture detection and segmentation in meal sessions.
Videofluoroscopic swallowing study (VFSS) visualizes the swallowing movement by using X-ray fluoroscopy, which is the most widely used method for dysphagia examination. To better facilitate swallowing assessment, the temporal parameter is one of the most important indicators. However, most information of that acquire is hand-crafted and elaborated, which is time-consuming and difficult to ensure objectivity and accuracy. In this paper, we propose to formulate this task as a temporal action localization task and solve it using deep neural networks. However, the action of VFSS has the following characteristics such as small motion targets, small action amplitudes, large sample variances, short duration, and variations in duration. Furthermore, all existing methods often rely on daily behaviors, which makes locating and recognizing micro-actions more challenging. To address the above issues, we first collect and annotate the VFSS micro-action dataset, which includes 847 VFSS data from 71 subjects, due to the lack of benchmarks. We then introduce a coarse-to-fine mechanism to handle the short and repeated nature of micro-actions, which can significantly enhancing micro-action localization accuracy. Moreover, we propose a Variable-Size Window Generator method, which improves the model's characterization performance and addresses the issue of different action timings, leading to further improvements in localization accuracy. The results of our experiments demonstrate the superiority of our method, with significantly improved performance (46.10% vs. 37.70%).
Automated detection of intake gestures with wearable sensors has been a critical area of research for advancing our understanding and ability to intervene in people's eating behavior. Numerous algorithms have been developed and evaluated in terms of accuracy. However, ensuring the system is not only accurate in making predictions but also efficient in doing so is critical for real-world deployment. Despite the growing research on accurate detection of intake gestures using wearables, many of these algorithms are often energy inefficient, impeding on-device deployment for continuous and real-time monitoring of diet. This paper presents a template-based optimized multicenter classifier that enables accurate intake gesture detection while maintaining low-inference time and energy consumption using a wrist-worn accelerometer and gyroscope. We designed an Intake Gesture Counter smartphone application (CountING) and validated the practicality of our algorithm against seven state-of-the-art approaches on three public datasets (In-lab FIC, Clemson, and OREBA). Compared with other methods, we achieved optimal accuracy (81.60% F1 score) and very low inference time (15.97 msec per 2.20-sec data sample) on the Clemson dataset, and among the top performing algorithms, we achieve comparable accuracy (83.0% F1 score compared with 85.6% in the top performing algorithm) but superior inference time (13.8x faster, 33.14 msec per 2.20-sec data sample) on the In-lab FIC dataset and comparable accuracy (83.40% F1 score compared with 88.10% in the top-performing algorithm) but superior inference time (33.9x faster, 16.71 msec inference time per 2.20-sec data sample) on the OREBA dataset. On average, our approach achieved a 25-hour battery lifetime (44% to 52% improvement over state-of-the-art approaches) when tested on a commercial smartwatch for continuous real-time detection. Our approach demonstrates an effective and efficient method, enabling real-time intake gesture detection using wrist-worn devices in longitudinal studies.
Full-text available
Wearable motion tracking sensors are now widely used to monitor physical activity, and have recently gained more attention in dietary monitoring research. The aim of this review is to synthesise research to date that utilises upper limb motion tracking sensors, either individually or in combination with other technologies (e.g., cameras, microphones), to objectively assess eating behaviour. Eleven electronic databases were searched in January 2019, and 653 distinct records were obtained. Including 10 studies found in backward and forward searches, a total of 69 studies met the inclusion criteria, with 28 published since 2017. Fifty studies were conducted exclusively in laboratory settings, 13 exclusively in free-living settings, and three in both settings. The most commonly used motion sensor was an accelerometer (64) worn on the wrist (60) or lower arm (5), while in most studies (45), accelerometers were used in combination with gyroscopes. Twenty-six studies used commercial-grade smartwatches or fitness bands, 11 used professional grade devices, and 32 used standalone sensor chipsets. The most used machine learning approaches were Support Vector Machine (SVM, n = 21), Random Forest (n = 19), Decision Tree (n = 16), Hidden Markov Model (HMM, n = 10) algorithms, and from 2017 Deep Learning (n = 5). While comparisons of the detection models are not valid due to the use of different datasets, the models that consider the sequential context of data across time, such as HMM and Deep Learning, show promising results for eating activity detection. We discuss opportunities for future research and emerging applications in the context of dietary assessment and monitoring.
Full-text available
Overweight and obesity are both associated with in-meal eating parameters such as eating speed. Recently, the plethora of available wearable devices in the market ignited the interest of both the scientific community and the industry towards unobtrusive solutions for eating behavior monitoring. In this paper we present an algorithm for automatically detecting the in-meal food intake cycles using the inertial signals (acceleration and orientation velocity) from an off-the-shelf smartwatch. We use 5 specific wrist micromovements to model the series of actions leading to and following an intake event (i.e. bite). Food intake detection is performed in two steps. In the first step we process windows of raw sensor streams and estimate their micromovement probability distributions by means of a Convolutional Neural Network (CNN). In the second step we use a Long-Short Term Memory (LSTM) network to capture the temporal evolution and classify sequences of windows as food intake cycles. Evaluation is performed using a challenging dataset of 21 meals from 12 subjects. In our experiments we compare the performance of our algorithm against three state-of-the-art approaches, where our approach achieves the highest F1 detection score (0.913 in the Leave-One-Subject-Out experiment). The dataset used in the experiments is available at
Conference Paper
Full-text available
The rising prevalence of non-communicable diseases calls for more sophisticated approaches to support individuals in engaging in healthy lifestyle behaviors, particularly in terms of their dietary intake. Building on recent advances in information technology, user assistance systems hold the potential of combining active and passive data collection methods to monitor dietary intake and, subsequently, to support individuals in making better decisions about their diet. In this paper, we review the state-of-the-art in active and passive dietary monitoring along with the issues being faced. Building on this groundwork, we propose a research framework for user assistance systems that combine active and passive methods with three distinct levels of assistance. Finally, we outline a proof-of-concept study using video obtained from a 360-degree camera to automatically detect eating behavior from video data as a source of passive dietary monitoring for decision support.
Shared plate eating is a defining feature of the way food is consumed in some countries and cultures. Food may be portioned to another serving vessel or directly consumed into the mouth from a centralised dish rather than served individually onto a discrete plate for each person. Shared plate eating is common in some low-and lower-middle income countries (LLMIC). The aim of this narrative review was to synthesise research that has reported on the assessment of dietary intake from shared plate eating, investigate specific aspects such as individual portion size or consumption from shared plates and use of technology in order to guide future development work in this area. Variations of shared plate eating that were identified in this review included foods consumed directly from a central dish or shared plate food, served onto additional plates shared by two or more people. In some settings, a hierarchical sharing structure was reported whereby different family members eat in turn from the shared plate. A range of dietary assessment methods have been used in studies assessing shared plate eating with the most common being 24-h recalls. The tools reported as being used to assist in the quantification of food intake from shared plate eating included food photographs, portion size images, line drawings, and the carrying capacity of bread, which is often used rather than utensils. Overall few studies were identified that have assessed and reported on methods to assess shared plate eating, highlighting the identified gap in an area of research that is important in improving understanding of, and redressing dietary inadequacies in LLMIC.
Features learned by deep Convolutional Neural Networks (CNNs) have been recognized to be more robust and expressive than hand-crafted ones. They have been successfully used in different computer vision tasks such as object detection, pattern recognition and image understanding. Given a CNN architecture and a training procedure, the efficacy of the learned features depends on the domain-representativeness of the training examples. In this paper we investigate the use of CNN-based features for the purpose of food recognition and retrieval. To this end, we first introduce the Food-475 database, that is the largest publicly available food database with 475 food classes and 247,636 images obtained by merging four publicly available food databases. We then define the food-domain representativeness of different food databases in terms of the total number of images, number of classes of the domain and number of examples for class. Different features are then extracted from a CNN based on the Residual Network with 50 layers architecture and trained on food databases with diverse food-domain representativeness. We evaluate these features for the tasks of food classification and retrieval. Results demonstrate that the features extracted from the Food-475 database outperform the other ones showing that we need larger food databases in order to tackle the challenges in food recognition, and that the created database is a step forward toward this end.
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncracies of training data and to explain failure cases of the system.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry