ArticlePDF Available

Abstract and Figures

Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) framelevel intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative F1 score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 1
Single-stage intake gesture detection using
CTC loss and extended prefix beam search
Philipp V. Rouast, Member, IEEE, Marc T. P. Adam
Abstract—Accurate detection of individual intake gestures is
a key step towards automatic dietary monitoring. Both inertial
sensor data of wrist movements and video data depicting the
upper body have been used for this purpose. The most advanced
approaches to date use a two-stage approach, in which (i) frame-
level intake probabilities are learned from the sensor data using
a deep neural network, and then (ii) sparse intake events are
detected by finding the maxima of the frame-level probabilities.
In this study, we propose a single-stage approach which directly
decodes the probabilities learned from sensor data into sparse
intake detections. This is achieved by weakly supervised training
using Connectionist Temporal Classification (CTC) loss, and de-
coding using a novel extended prefix beam search decoding algo-
rithm. Benefits of this approach include (i) end-to-end training for
detections, (ii) simplified timing requirements for intake gesture
labels, and (iii) improved detection performance compared to
existing approaches. Across two separate datasets, we achieve
relative F1score improvements between 1.9% and 6.2% over
the two-stage approach for intake detection and eating/drinking
detection tasks, for both video and inertial sensors.
Index Terms—Deep learning, CTC, intake gesture detection,
dietary monitoring, inertial and video sensors
I. INT ROD UC TI ON
ACCURATE information on dietary intake forms the ba-
sis of assessing a person’s diet and delivering dietary
interventions. To date, such information is typically sourced
through memory recall or manual input, for example via dieti-
tians [1] or smartphone apps used to log meals. Such methods
are known to require substantial time and manual effort, and
are subject to human error [2]. Hence, recent research has
investigated how dietary monitoring can be partially automated
using sensor data and machine learning [3].
Detection of individual intake gestures in particular is a key
step towards automatic dietary monitoring. Wrist-worn inertial
sensors provide an unobtrusive way to detect these gestures.
Early work on the Clemson dataset, established in 2012, used
threshold values for detection from inertial data [4]. More
recent developments include the use of machine learning to
learn features automatically [5] and learning from video, which
has become more practical with emerging spherical camera
technology [6] [7]. Research on the OREBA dataset showed
that frontal video data can exhibit even higher accuracies in
detecting eating gestures than inertial data [8].
The two-stage approach introduced by Kyritsis et al. [9]
is currently the most advanced approach benchmarked on
publicly available datasets for both inertial [9] and video data
The authors are with the School of Electrical Engineering and Computing,
The University of Newcastle, Callaghan, NSW 2308, Australia. E-mail:
philipp.rouast@uon.edu.au, marc.adam@newcastle.edu.au.
OREBA
Intake
detection
(Inertial)
OREBA
Intake
detection
(Video)
OREBA
Eat/drink
detection
(Inertial)
OREBA
Eat/drink
detection
(Video)
Clemson
Intake
detection
(Inertial)
Clemson
Eat/drink
detection
(Inertial)
0.80.79
0.76
0.73
0.78
0.68
0.83
0.86
0.78
0.84
0.78
0.74
0.86
0.88
0.83
0.86
0.81
0.78
SOTA [6] [10] (our imp.) Two-stage (ours) Single-stage (ours)
Fig. 1. F1scores for our two-stage and single-stage models in comparison
with the state of the art (SOTA). Our single-stage models see relative
improvements between 3.3% and 17.7% over our implementations of the
SOTA for inertial [10] and video modalities [6], and relative improvements
between 1.9% and 6.2% over our own two-stage models for intake detection
and eating/drinking detection across the OREBA and Clemson datasets.
[6]. It first estimates frame-level intake probabilities using
deep learning, which are then searched for maxima to detect
intake events. Thereby, the two-stage approach builds on a
predefined gap between intake gestures in the second stage.
In this paper, we propose a single-stage approach which
directly decodes the probabilities learned from sensor data
into sparse intake event detections. We achieve this by weakly
supervised training [11] of the underlying deep neural network
with Connectionist Temporal Classification (CTC) loss, and
decoding the probabilities using a novel extended prefix beam
search algorithm. Compared to the existing approaches in the
literature, our study makes four key contributions:
1) Single-stage approach. This is the first study that applies
a single-stage approach allowing for end-to-end training
with a loss function that directly addresses the intake
gesture detection task. Thereby, we avoid the predefined
gap between subsequent intake gestures in the second
stage of two-stage models [9] [6].
2) Simplified labels. The proposed approach requires infor-
mation about occurrence and order of intake gestures, but
not their exact timing. Hence, it is particularly suitable
for intake gestures, whose start and end times are fuzzy
in nature and time-consuming to determine.
3) Improved performance. Our single-stage models out-
perform two-stage models on the OREBA and Clemson
datasets, including the current state of the art (SOTA) [6]
[10] and two-stage versions of our models, see Fig. 1.
4) Intake gesture detection. This is the first study to
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 2
t
0
1
t
0
1
pT
pdrink
peat
DrinkDrink
Eat
Stage I
Frame-level
probability
estimation
t
0
T1
T2
Angular velocity
(wrist roll)
Sequence
level
detection
(cannot distinguish
between dierent
intake gesture classes)
Sequence
level
detection
Extended
prex beam
search
Probability
distribution
estimation
Single-stage approach
(proposed)
Two-stage approach
e.g. ≥ 2 seconds
Stage II
≥ T3
≥ T4
Model
frames frames
Model Model
Model
Thresholding approach
frames
E
D
ε
E
D
ε
E
D
ε
E
D
ε
E
D
ε
E
D
ε
E
D
ε
E
D
ε
Model
Decoding
Alignment
Collapsed Alignment
Output
Ground truth
E E D
E E εEεDDD
Eεε Eε ε ε
D
E E D
Ground truth
E E D
Ground truth
E E D
Fig. 2. Comparing existing approaches (left, center) to the proposed approach (right): The thresholding approach [4] (left) searches the angular velocity for
values that breach the thresholds T1and T2. The two-stage approach [9] (center) independently estimates frame-level probabilities, which are then searched
for maxima on the video level (generalized to two gesture classes here). The proposed single-stage approach (right) directly decodes the estimated probability
distribution p(c|xt)using extended prefix beam search, after which token sequences in the most probable alignment ˆ
Aare collapsed to yield the result.
perform simultaneous localization and classification1of
intake gestures. While we use the example of eating and
drinking, the approach could also be applied to more fine-
grained analysis of dietary intake given appropriate data.
The remainder of the paper is organized as follows: In
Section II, we discuss the related literature on CTC and
intake gesture detection. Our proposed method is introduced
in Section III, including a complete pseudo-code listing of
our proposed decoding algorithm. We present and analyse
the evaluation of our proposed model and the SOTA on two
datasets in Section IV. Finally, we discuss the relative merits
of the single-stage and two-stage approaches in Section V and
conclude in Section VI.
II. RE LATE D RES EA RC H
A. Intake gesture detection
Intake gesture detection involves the determination of the
timestamps at which a person moved their hands to ingest
food or drink during an eating occasion. It is one of the
three elements of automatic dietary monitoring, which also
encompasses classification of the consumed type of food,
and estimation of the consumed quantity of food. Sensors
that carry a signal appropriate for the detection of intake
gestures include inertial sensors mounted to the wrist [12]
and video recordings [6]. Note that information on eating
events can also be derived from chewing and swallowing
monitored using audio [13] [14], electromyography [15] [16],
and piezoelectric sensors [17]. There are also other recent
video-based approaches based on skeletal and mouth [18]
as well as food, hand and face [7] features extracted using
deep learning. For inertial data, there is recent work on in-
the-wild monitoring [19]. In the following, we focus on two
main approaches for inertial and video data that have been
benchmarked on publicly available datasets:
1For the purpose of this study, gesture detection refers to temporal local-
ization and simultaneous classification of a gesture (e.g., as a generic intake
gesture, or as an eating or drinking gesture).
1) Thresholding approach: In 2012, Dong et al. [4] devised
an easily interpretable thresholding approach which requires
the angular velocity around the wrist to first surpass a positive
threshold (e.g., rolling one way to pick up food), and then
a negative threshold (e.g., rolling the other way to pass food
to the mouth). Refer to Fig. 2 (left) for an illustration. The
approach selects these thresholds and two further parameters
for minimum time amounts during and after a detection based
on an exhaustive search of the parameter space. Note that this
approach is not generalizable to multiple gesture classes.
2) Two-stage approach: Kyritsis et al. [9] proposed a two-
stage approach for detecting intake gestures from accelerom-
eter and gyroscope data. Rouast and Adam [6] later adopted
this approach for video data. In this approach, the first stage
produces frame-level estimates for the probability of intake
versus non-intake. These estimates are provided iteratively by
a neural network trained on a sliding two-second context. The
second stage identifies the sparse video-level intake gesture
timings by operating a thresholded maximum search on the
frame-level estimates, constrained by a minimum distance of
two seconds between detections. Fig. 2 (center) illustrates this
approach generalized to two intake gesture classes.
While this approach is also relatively easy to interpret and
works well in practice [19], there are a few aspects that need
to be considered. Firstly, the second stage requires a prede-
fined gap of two seconds between subsequent intake gestures.
This predefined gap implies that consecutive events occurring
within two seconds of each other lead to false negatives.
Secondly, the loss function during neural network training is
geared towards optimizing the frame-level predictions, not the
video-level detections. In the present work, we propose an
alternative approach by introducing a new single-stage training
and decoding approach using CTC – see Fig. 2 (right).
B. Connectionist temporal classification
In 2006, Graves et al. [20] proposed connectionist temporal
classification (CTC) to allow direct use of unsegmented input
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 3
data in sequence learning tasks with recurrent neural networks
(RNNs). By interpreting network output as a probability
distribution over all possible token sequences, they derived
CTC loss, which can be used to train the network via back-
propagation [21]. Hence, what sets CTC apart from previous
approaches is the ability to label entire sequences, as opposed
to producing labels independently in a frame-by-frame fashion.
While the original application of CTC was phoneme recog-
nition [20], researchers have applied it in various sequence
learning tasks such as end-to-end speech recognition [22],
handwriting recognition [23], and lipreading [24]. Further,
CTC has also been applied to sign language recognition
from wrist-worn inertial sensor data [25] [26]. In the most
closely related prior research to the present work, Huang
et al. [11] extended the CTC framework to enable weakly
supervised learning of actions from video, simplifying the
required labelling process. To this day, CTC has neither been
applied for temporal localization of actions from sensor data
nor intake gesture detection.
III. PROP OS ED ME TH OD
Our proposed approach interprets the problem of intake
gesture detection as a sequence labelling problem using CTC.
This allows us to operate within a single-stage approach,
meaning that inference is operationalized for a single time
window of data, as exemplified in Fig. 3:
A probability distribution over possible events for each
time step is estimated using a neural network previously
trained with CTC loss [20].
These probabilities are decoded using extended prefix
beam search and collapsed to derive the gesture timings.
We start by introducing the concept of alignments as well as
the CTC loss function. Then, we describe greedy decoding and
prefix beam search as alternative decoding algorithms which
provide the motivation for our extension. Finally, we introduce
the proposed extended prefix beam search.
A. Alignment between sensor data and labels
In many pattern recognition tasks involving the mapping
of input sequences Xto corresponding output sequences Y,
we encounter challenges relating to the alignment between
the elements of Xand Y. This is because real-world sensor
data cannot always be aligned with fixed-size tokens: In
handwriting recognition, for example, some written letters
in Xare spatially wider than others, unlike the fixed-size
tokens in Y[23]. A similar challenge arises in intake gesture
detection, where gesture events can have various durations.
To account for the dynamic size of events in the input, we
create an alignment Aby using the token in question multiple
times [27], such as in the example in Fig. 3. In addition, we
introduce the blank token to allow separation of multiple in-
stances of the same event class, A= [E, E, , E, E, D, D, D]
in the example. We derive the token sequence Yfrom an
alignment Aby first collapsing repeated tokens and then
removing the blank token. Hence, the token sequence for the
example is Y= [E, E , D], which correctly reflects the ground
truth label. Any one collapsed output token sequence Ycan
have many possible corresponding alignments A.
time t1t2t3t4t5t6t7t8
Dataset
Data
Label
frames
ground
truth Eat Eat Drink
ALE E E E D D D
YLE E D
Single-stage approach
p(c|xt)
0.3 0.25 0.6 0.4 0.5 0.3 0.1 0.2
E0.5 0.6 0.2 0.35 0.4 0.3 0.2 0.3
D0.2 0.15 0.2 0.25 0.1 0.4 0.7 0.5
Greedy
decoding
AGE E DDD
YGE D
Prefix
beam search
AB?
YBE E D
Extended
prefix
beam search
AEE E EDDD
YEE E D
Fig. 3. An example with (1) dataset represented by data and label with
corresponding alignment ALand collapsed token sequence YL, (2) the
single stage approach for intake gesture detection with estimated probabilities
p(c|xt), and alignments as well as collapsed token sequences produced by
greedy decoding,prefix beam search as well as extended prefix beam search.
Note that finding the alignment AEproduced by extended prefix beam search
is the key element missing for simple prefix beam search.
B. CTC loss for probability distribution estimation
Suppose we have an input sequence Xof length T, the
corresponding output token sequence Y, and possible tokens
Σ. Our network is designed to express a probability estimate
p(c|xt)for each token cin Σgiven the sensor input xtat
time t. Fig 3 continues the previous example to show what
the network output p(c|xt)might look like. The objective
of CTC loss is to minimize the negative log-likelihood of
p(Y|X), which is the probability that the network predicts Y
when presented with X. This probability can be expressed in
the form given in Equation 1 [27], building on the individual
tokens atin all valid alignments AX,Y between Xand Y.
p(Y|X) = X
AAX,Y
T
Y
t=1
p(c=at|xt)(1)
To train our single-stage networks for intake gesture de-
tection, we use an implementation of CTC loss included in
TensorFlow [28]. This training process can be characterized
as weakly supervised, since it only requires the less restrictive
collapsed labels Ywhich do not include timing information
besides occurrence and order of the tokens. An implication of
using CTC loss is that our networks learn to make predictions
differently than when trained with cross-entropy loss, as we
explore further in Section IV-E. It also implies that examples
are required to regularly contain multiple intake gestures for
the network to learn properly (e.g., two eating and one drinking
gesture in Fig. 3).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 4
C. Greedy decoding
During inference, we decode the probabilities p(c|xt)into a
sequence of tokens Y. This can be interpreted as choosing an
alignment A, which is then collapsed to Y. A fast and simple
solution is Greedy decoding, which chooses the alignment by
selecting the maximum probability token at each time step t
[27]. However, this method is not guaranteed to produce the
most probable Y, since it does not take into account that each
Ycan have many possible alignments. In the example of Fig
3, greedy decoding gives the alignment [E, E, , , , D, D, D]
which collapses to [E, D]. Using Equation 1, we can compute
that this is indeed an inferior solution to [E, E , D].2
D. Prefix beam search
Traversing all possible alignments turns out to be infeasible
due to their large number [27]. The prefix beam search algo-
rithm [20] uses dynamic programming to search for a token
sequence ˆ
Ythat maximises p(ˆ
Y|X). It presents a trade-off be-
tween computation and solution quality, which can be adjusted
through the beam width k, determining how many possible
solutions the algorithm remembers. Prefix beam search with a
beam width of 1 is equivalent to greedy decoding. However, it
is important to note that prefix beam search does not remember
specific alignments. Hence, it is not possible to temporally
localize intake events (see missing ABin Fig. 3).
The algorithm determines beams in terms of prefixes `
(candidates for the output token sequence ˆ
Yup to time t),
which are stored in a list Y. Each prefix is associated with two
probabilities, the first of ending in a blank, pb(`|x1:t), and the
second of not ending in a blank, pnb(`|x1:t). For each time
step t, the algorithm updates the probabilities for every prefix
in Yfor the different cases of (i) adding a repeated token and
(ii) adding a blank, and adds possible new prefixes. Due to the
algorithm design, branches with equal prefixes are dynamically
merged. The algorithm then keeps the kbest updated prefixes.
E. Extended prefix beam search
Standard prefix beam search finds a token sequence ˆ
Y,
without retaining information about the alignments AX, ˆ
Y. In
order to infer the timing of the decoded events in a way
consistent with CTC loss, the goal of our extended prefix beam
search is to find ˆ
A. This is the most probable alignment that
could have produced ˆ
Y, as expressed by Equation 2.
ˆ
A= arg max
AX, ˆ
Y
T
Y
t=1
p(c=at|xt)(2)
Instead of running a separate algorithm based on ˆ
Y, we
search for ˆ
Asimultaneously as part of prefix beam search,
which already includes most of the necessary computation. We
add two additional lists for each beam `,Ab(`)and Anb(`),
which store alignment candidates that resolve to `as well
as their corresponding probabilities. Every time a probability
is updated in prefix beam search, we add new alignment
candidates and associated probabilities to the appropriate lists.
2Specifically, applying CTC loss to the numerical example in Fig. 3, we
find that p([E, D]|X)0.0719 <0.1305 p([E , E, D]|X)
TABLE I
ARC HIT EC TUR ES FO R OU R SIN GL E-STAG E AN D TWO -STAGE M OD ELS
Video Inertial
Layer ResNet-50 CNN-LSTM ResNet-10 CNN-LSTM
OREBA OREBA Clemson
params output size params output size output size
data 16 ×1282×3 512 ×12 120 ×6
conv1 52,64
stride 12
16×1282×64 1,64
stride 1
512 ×64 120 ×64
pool1 22
stride 22
16 ×642×64
conv2
12,64
32,64
12,256
×3 16×642×256 "3,64
3,64#512 ×64 120 ×64
conv3
12,128
32,128
12,512
×4 16×322×512 "3,128
3,128#256×128 120×128
conv4
12,256
32,256
12,1024
×6 16×162×1024 "5,256
5,256#128×256 60×256
conv5
12,512
32,512
12,2048
×3 16×82×2048 "5,512
5,512#64 ×512 60 ×512
pool 16 ×2048
lstm 16 ×128 64 ×64 60 ×64
densea16 × |Σ|64 × |Σ|60 × |Σ|
aΣincludes the blank token, hence |Σ|= 2 for generic intake gesture
detection and |Σ|= 3 for detection of eating and drinking gestures.
This includes (i) adding a repeated token, (ii) adding a blank
token, and (iii) adding a token that extends the prefix. The
algorithm design implies that if two beams with identical
prefixes are merged, alignment candidates are also merged
dynamically. At the end of each time step t, we resolve the
alignment candidates for each `in Yby choosing the highest
probability for each Ab(`)and Anb(`). Finally, for each of the
kbest token sequences in Y, the best alignment candidate ˆ
A
is chosen as the more probable one out of Ab(`)and Anb(`).
We created a Python implementation3of the pseudo-code
shown in Algorithm 1. Note that this version is not created
with efficiency in mind. For our experiments, we implemented
a more efficient implementation4as a C++ TensorFlow kernel.
F. Network architectures
Although they are trained with different loss functions, both
the single-stage and two-stage approaches each rely on an
underlying deep neural network which estimates probabili-
ties. We choose adapted versions of the ResNet architecture
[29]. Our video network is a CNN-LSTM with a ResNet-50
backbone adjusted for our video resolution. For inertial data,
we use a CNN-LSTM with a ResNet-10 backbone using 1D
convolutions. Table I reports the parameters and output sizes
for all layers.
3See https://gist.github.com/prouast/a73354a7586cc6bc444d2013001616b7
4Available at https://github.com/prouast/ctc-beam-search-op
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 5
Algorithm 1: Extended prefix beam search algorithm (loosely based on [30]): The algorithm stores current prefixes in Y.
Probabilities are stored and updated in terms of prefixes ending in blank pb(`|xt)and non-blank pnb(`|xt), facilitating dynamic
merging of beams with identical prefixes. The empty set is used to initialize Yand associated with probability 1for blank,
and 0for non-blank. Ab(`)and Anb(`)store the current candidates for alignments (ending in blank and non-blank) pertaining
to prefix `, along with their probabilities. They are likewise initialized for the empty prefix. The algorithm then loops over the
time steps, updating the prefixes and associated alignments. Each current candidate `is re-entered into the new prefixes Y0,
adjusting the probabilities for repeated tokens and added blanks. The corresponding alignment candidates and their probabilities
are added to the new alignment candidates A0
nb(`)and A0
b(`). Furthermore, for each non-blank token in Σ, a new prefix is
created by concatenation, the probability is updated, and corresponding alignment candidates are added. At the end of each
time step, we set Yto the kmost probable prefixes in Y0and resolve the alignment candidates for each of those prefixes as
the most probable ones. Finally, for each of the kbest token sequences in Y, the best alignment candidate is chosen as the
more probable one out of Ab(`)and Anb(`).
Data: Probability distributions p(c|xt)for tokens cΣin sensor data xtfrom t= 1, . . . , T .
Result: kbest decoded sequences of tokens Yand best corresponding alignments A.
1pb(∅|x1:0)1,pnb (∅|x1:0)0
2Y← {∅}
3Ab()← {(,1)},Anb()← {(,1)}
4for t= 1, . . . , T do
5Y0← {}
6A0
b(·)← {},A0
nb(·)← {}
7for `in Ydo
8if ` /Y0then
9add `to Y0
10 end
11 if `6=then
12 pnb(`|x1:t)pnb (`|x1:t) + pnb(`|x1:t1)p(`|`||x1:t)
13 add (concatenate Anb(`)and `|`|,p(Anb (`))p(`|`||x1:t) ) to A0
nb(`)
14 end
15 pb(`|x1:t)pb(`|x1:t) + p(|x1:t)(pb(`|x1:t1) + pnb(`|x1:tt))
16 add (concatenate Ab(`)and ,p(Ab(`))p(|x1:t) ) to A0
b(`)
17 add (concatenate Anb(`)and ,p(Anb (`))p(|x1:t) ) to A0
b(`)
18 for cin Σ\do
19 `+concatenate `and c
20 add `+to Y0
21 if `6=and c=`|`|then
22 pnb(`+|x1:t)pnb (`+|x1:t) + pb(`|x1:t1)p(c|x1:t)
23 add (concatenate Anb(`)and c,p(Ab(`))p(c|x1:t) ) to A0
nb(`+)
24 else
25 pnb(`+|x1:t)pnb (`+|x1:t) + p(c|x1:t)(pb(`|x1:t1) + pnb(`|x1:t1))
26 add (concatenate Ab(`)and c,p(Ab(`))p(c|x1:t) ) to A0
b(`+)
27 add (concatenate Anb(`)and c,p(Anb (`))p(c|x1:t) ) to A0
nb(`+)
28 end
29 end
30 end
31 Ykmost probable prefixes in Y0
32 for `in Ydo
33 Ab(`)the most probable sequence in A0
b(`)
34 Anb(`)the most probable sequence in A0
nb(`)
35 end
36 end
37 for `in Ydo
38 A(`)the most probable sequence in {Ab(`), Anb(`)}
39 end
40 return Y,A
41
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 6
IV. EXP ER IM EN TS A ND A NALYSI S
In the experiments, we compare the proposed single-stage
approach to the thresholding [4] and the two-stage approach
[9] [10] using two datasets of annotated intake gestures
(OREBA [6] and Clemson [31]). To this day, these are the
largest publicly available datasets for intake gesture detection.
For both datasets, we attempt detection of generic intake
gestures, as well as detection of eating and drinking gestures.
Across our experiments, we use time windows of 8 seconds,
which ensures that examples regularly contain multiple intake
events. All code used for the experiments is available at
https://github.com/prouast/ctc-intake-detection.
A. Approaches
1) Thresholding approach: We implemented the threshold-
ing approach with four parameters as described by Dong et
al. [4] and Shen et al. [31], which only relies on angular
velocity (wrist roll). For each dataset, we used the training
set to estimate the parameters T1,T2,T3, and T4.
2) Two-stage approach: SOTA results on OREBA [6] [10]
are based on 2 second time windows, which is not sufficient
for the single-stage approach. Hence, to facilitate a fair com-
parison, we also train several two-stage models based on 8
second time windows. In particular, we use cross-entropy loss
to train two-stage versions of our own architectures outlined in
Table I, as well as the architectures proposed in Heydarian et
al. [10], Rouast et al. [6], and the adapted version of Kyritsis et
al. [9] used in [10]. Note that the latter was originally designed
to be trained with additional sub-gesture labels which are not
available for the Clemson and OREBA datasets. These models
are trained with cross-entropy loss. Detections on the video
level are reported according to the Stage 2 maximum search
algorithm by [9]. To facilitate multi-class comparison, we also
extend the Stage 2 search by applying the same threshold to
both intake gesture classes.
3) Single-stage approach: Our single-stage models are
trained using CTC loss [20]. One caveat of the single-stage
approach is that it requires a longer time window than Stage
1 of the two-stage approach. This is to ensure that multiple
gestures regularly appear in the training examples, providing
a signal for learning temporal relations. At the same time,
due to memory restrictions for the video model, longer time
windows come with the drawback of having to reduce the
sampling rate of the input data. In light of this tradeoff, we
considered different configurations and ultimately decided for
a window size of 8 seconds. 5For inference, the probabilities
estimated for each temporal segment are decoded into an
alignment using the Extended prefix beam search, and then
collapsed to yield event detections. Based on an analysis on
the validation set (see Section IV-F), we used a beam width
of 3. On the video level, we first aggregate detections from
the individual alignments of sliding windows using frame-wise
majority voting before collapse.
5A window size of 8 seconds allows a video sampling rate of 2 fps and
translates into a 74.7% chance of seeing at least one example with multiple
gestures per batch during video model training on OREBA. Details on the
considered window sizes can be found in the Supplemental Material S1.
tTP FNFP1FP2FP3
1 2 3 4 5 Ground
truth
Other class
Detections
Fig. 4. The evaluation scheme (proposed by [9]; figure adapted from [6]). (1)
A true positive is the first detection within each ground truth event; (2) False
positives of type 1 are further detections within the same ground truth event;
(3) False positives of type 2 are detections outside ground truth events; (4)
False positives of type 3 are detections made for the wrong class if applicable;
(5) False negatives are non-detected ground truth events.
B. Training and evaluation metrics
1) Training: All networks are trained using the Adam
optimizer on the respective training set with batch size 128 for
inertial and 16 for video. We use an exponentially decreasing
learning rate starting at 1e-3, except for the SOTA implemen-
tations where we use the learning rate settings reported by
the authors [10] [9] [6]. We also use minibatch loss scaling,
analogously to [6]. Hyperparameter and model selection is
based on the validation set.
2) Evaluation: For comparison we use the F1measure,
applying an extension of the evaluation scheme by Kyritsis
et al. [9] (see Fig. 4). The scheme uses the ground truth
to translate sparse detections into measurable metrics for a
given label class. As Rouast and Adam [6] report, one correct
detection per ground truth event counts as a true positive (TP ),
while further detections within the same ground truth event
are false positives of type 1 (FP1). Detections outside ground
truth events are false positives of type 2 (FP 2) and non-
detected ground truth events count as false negatives (FN ).
We extended the original scheme to support the multi-class
case, where detections of a wrong class are false positives
of type 3 (FP 3). Using the aggregate counts, we calculate
precision, recall, and F1.
C. Datasets
1) OREBA: The OREBA dataset [8] includes inertial and
video data. This dataset was approved by the IRB at The
University of Newcastle on 10 September 2017 (H-2017-
0208). Specifically, we use the OREBA-DIS scenario with data
for 100 participants (69 male, 31 female) and 4790 annotated
intake gestures. The split suggested by the dataset authors [8]
includes training, validation, and test sets of 61, 20, and 19
participants. For the inertial models, we use the processed6
accelerometer and gyroscope data from both wrists at 64 Hz (8
seconds correspond to 512 frames). For the video models, we
downsample the 140x140 pixel recordings from 24 fps to 2 fps
(8 seconds correspond to 16 frames). For data augmentation,
we use random mirroring of the wrist for inertial data and
the same steps as [6] for video data, which includes spatial
cropping to 128x128 pixels.
2) Clemson: The publicly available Clemson dataset [31]
consists of 488 annotated eating sessions across 264 partici-
pants (127 male, 137 female), a total of 20644 intake gestures.
6Processing includes mirroring for data uniformity, removal of the gravity
effect using Madgwick’s filter [32], and standardization.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 7
TABLE II
RES ULTS F OR TH E OREBA AND CLEMSON DATAS ETS (T ES T SET )
Intake gestures (E)ating and (D)rinking gestures
Method Dataset Modality F1FE
1FD
1FED
1
Thresholding [4] (T1= 25,T2=25,T3= 2,T4= 2, 64 Hz) OREBA Inertial 0.275
Two-stage CNN-LSTM [10] (2 sec @ 64 Hz)aOREBA Inertial 0.778
Two-stage CNN-LSTM [9] (our implementation, 8 sec @ 64 Hz) OREBA Inertial 0.740 0.732 0.657 0.726
Two-stage CNN-LSTM [10] (our implementation, 8 sec @ 64 Hz) OREBA Inertial 0.799 0.772 0.696 0.765
Two-stage ResNet-10 CNN-LSTM (ours, 8 sec @ 64 Hz) OREBA Inertial 0.831 0.798 0.638 0.783
Single-stage ResNet-10 CNN-LSTM (ours, 8 sec @ 64 Hz) OREBA Inertial 0.855 0.837 0.770 0.832
Two-stage ResNet-50 SlowFast [6] (2 sec @ 8 fps)aOREBA Video 0.853
Two-stage ResNet-50 SlowFast [6] (our implementation, 8 sec @ 2 fps) OREBA Video 0.793 0.751 0.566 0.730
Two-stage ResNet-50 CNN-LSTM (ours, 8 sec @ 2 fps) OREBA Video 0.858 0.841 0.859 0.843
Single-stage ResNet-50 CNN-LSTM (ours, 8 sec @ 2 fps) OREBA Video 0.875 0.869 0.761 0.859
Thresholding [4] (T1= 15,T2=15,T3= 1,T4= 4, 15 Hz) Clemson Inertial 0.362
Two-stage CNN-LSTM [9] (our implementation, 8 sec @ 15 Hz) Clemson Inertial 0.728 0.673 0.641 0.668
Two-stage CNN-LSTM [10] (our implementation, 8 sec @ 15 Hz) Clemson Inertial 0.783 0.680 0.697 0.683
Two-stage ResNet-10 CNN-LSTM (ours, 8 sec @ 15 Hz) Clemson Inertial 0.781 0.743 0.733 0.741
Single-stage ResNet-10 CNN-LSTM (ours, 8 sec @ 15 Hz) Clemson Inertial 0.808 0.773 0.863 0.783
aTest set results as reported in [8]. These models use time windows of 2 seconds, while single-stage models require 8 seconds due to their nature.
Sensor data for accelerometer and gyroscope is available for
the dominant hand at 15 Hz (8 seconds correspond to 120
frames). We apply the same preprocessing and data augmen-
tation as for OREBA. We split the sessions into training,
validation, and test sets (302, 93 and 93 sessions respectively)
such that each participant appears in only one of the three (see
Supplementary Material S3). Note that because the Clemson
dataset does not specify a dataset split, an alternative approach
to test our models would have been k-fold cross-testing.
However, we decided for a specific split approach because
(1) there is no data scarcity in the Clemson dataset that
would require k-fold cross-testing, and (2) applying k-fold
cross-testing on the Clemson dataset would be prohibitively
expensive. However, a shortcoming of this approach is that
the results reported in Table II only reflect the test set which
was selected by ourselves, not the original dataset authors.
D. Results
Results are listed in Table II, and extended results with
detailed metrics are available in Supplementary Material S2.
1) Detecting intake gestures: The results for detecting only
one generic intake event class are displayed in the center
column of Table II. We can see that the single stage approach
generally yields higher performance than the thresholding and
two stage approaches: Relative improvements range between
2.0% (0.858 0.875) and 3.5% (0.781 0.808) over two-
stage versions of our own architectures, and between 3.3%
(0.783 0.808) and 10.4% (0.793 0.875) over our imple-
mentations of the SOTA.
For OREBA, we can additionally refer to previously pub-
lished SOTA results based on 2 second windows. Relative
improvements over these results for the inertial [10] and
video [6] modalities equal 10.0% (0.778 0.855) and 2.6%
(0.853 0.875), respectively.
For Clemson, we are not aware of any SOTA models other
than the thresholding approach [4] [31]. It is not surprising that
both the two-stage and single-stage approach outperform the
thresholding approach by a large margin. Thresholding exclu-
sively relies on one gyroscope channel, while the deep learning
models build on a larger number of parameters. Consistent
with the OREBA results, we find that the single-stage approach
yields a relative improvement of 3.5% (0.7810.808) over
the two-stage models on the Clemson dataset. It is worth
noting that the F1scores are generally lower for Clemson
than for OREBA, indicating that it is more challenging for
intake gesture detection. However, this may be related to the
lower sampling rate in Clemson and the fact that data for both
wrists is available for OREBA, while only the dominant wrist
is included in Clemson.
2) Detecting eating and drinking gestures: This task con-
sists of localization and simultaneous classification of intake
gestures as either eating or drinking. As there are no previously
published results for this more fine-grained classification on
either dataset, we rely on comparison between the separately
trained single-stage and two-stage versions of our own models,
as well as our implementations of the SOTA. In the right hand
side columns of Table II, we report separate F1scores for
eating and drinking individually, as well as both together.
Three main observations emerge. Firstly, the single-stage
approach outperforms the two-stage approach to an even larger
extent for this task: Relative improvements range from 1.9%
(0.843 0.859) to 6.2% (0.783 0.832) over two-stage ver-
sion of our own architectures, and from 8.7% (0.765 0.832)
to 17.7% (0.730 0.859) over our implementations of SOTA
architectures. Secondly, the increased difficulty of this task
compared to the generic detection task is noticeable in the
difference between the F1and FED
1scores, an average
decrease of 3.7% for OREBA and 7.3% for Clemson. Thirdly,
there are generally few misclassifications between eating and
drinking. As indicated by Table III, the frequency of false
positives of type 2 is higher than the frequency of false
positives of type 3 by almost two orders of magnitude.
Overall, the single-stage video models achieve the best
results on OREBA. However, when focusing specifically on
drinking detection, the two-stage video model achieves a better
result. This may be due to the low number of drinking gestures
in the test set, which causes FD
1to randomly vary from FED
1
for multiple of the models in Table II.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 8
Video
5
0
5
10
Accel.
10
5
0
5
Gyro.
0
0.5
1
Label
Two-stage ResNet-50 CNN-LSTM (Video) — Cross-entropy loss
0
0.5
1
p
Single-stage ResNet-50 CNN-LSTM (Video) — CTC loss
0
0.5
1
p
Two-stage ResNet-10 CNN-LSTM (Inertial) — Cross-entropy loss
0
0.5
1
p
Single-stage ResNet-10 CNN-LSTM (Inertial) — CTC loss
1 sec 2 sec 3 sec 4 sec 5 sec 6 sec 7 sec
0
0.5
1
p
Eat Drink
Fig. 5. Illustrating the effect of training with CTC loss or cross-entropy loss using input data, label, and model predictions for one 8 second example from
the OREBA validation set.
TABLE III
AVERA GED R ESU LTS AC ROS S ALL E XP ERI MEN TS (T EST S ET ). NU MB ER O F T P ,F P1,F P2,F P3,AND F N A RE E XPR ES SED A S PE RCE NTAGE S OF T HE
RE SPE CT IVE G ROU ND TR UTH N UM BER O F GE STU RES T O FACI LITATE C OMPA RIS ON S.
Method T P [%] F P1[%] F P2[%] F P3[%] F N [%] F1
Two-stage 76.39 2.15 10.80 0.17 23.61 0.8063
Single-stage, greedy decoding 79.48 0.48 10.53 0.15 20.52 0.8341
Single-stage, extended prefix beam search 80.58 0.49 11.76 0.15 19.42 0.8355
E. Effect of training with CTC loss or cross-entropy loss
During our introduction of CTC loss in Section III-B, we
mentioned that weakly supervised training with CTC causes
our networks to learn a different approach of detecting events
than cross-entropy loss. We can think of cross-entropy loss as
causing the network to predict whether a frame occurs anytime
during the gesture that is being detected. The analogous way
of thinking about CTC loss is to predict which frames are
the most distinctive about the gesture that is being detected.
This causes the signature for predictions by our single-stage
models to look more like probability spikes, while the two-
stage models produce sequences of high probability values.
We illustrate this characteristic difference between the
single-stage and two-stage approaches in Fig. 5 using an
example from the validation set of OREBA for eating and
drinking detection. Here, time-synchronized 2 fps video and
64 Hz inertial data (dominant hand) for one 8 second time
window are plotted alongside the ground truth and predictions
of the corresponding two-stage and single-stage models. Note
that the output frequencies of the models differ, with 2 Hz
for the video models and 8 Hz for the inertial models. We
observe that the predictions by the two-stage models indeed
mimic the ground truth, while the single-stage models produce
probability spikes. Furthermore, these probability spikes line
up temporally with the patterns that appear to be most distinct
about the gestures for the human eye.
For a broader view of these characteristic differences be-
tween the single-stage and two-stage models, we use linear
interpolation to aggregate the probabilities within all true
positives in the validation set on a unitless timescale. The
distributions displayed in Fig. 6 confirm that the two-stage
models mimic the ground truth, while the probability spikes for
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 9
0
0.5
1
t
p
(a) Inertial (OREBA)
t
(b) Video (OREBA)
0
0.5
1
t
p
Eat (Single-stage)
Drink (Single-stage)
Eat (Two-stage)
Drink (Two-stage)
(c) Inertial (Clemson)
Fig. 6. Aggregating the predicted probabilities within all eating and drinking
events in the validation sets of OREBA and Clemson. Probabilities are aligned
in time and linearly interpolated, based on which we plot the mean and
[q25, q75 ]interval. The characteristic peaks for single-stage models trained
on inertial data appear to be clustered in the second half of ground truth
events, while they mainly fall in the first half for models trained on video
data.
single-stage models seem to be clustered in regions specific to
the sensor modality. While the probability spikes for the video
models tend to fall in the first half of the ground truth events,
those for the inertial models appear mainly in the second
half. This lends itself to the interpretation that video models
target the frame in which ingestion takes place or the mouth is
open (relatively early in the ground truth event), while inertial
models leverage the characteristic downwards motion when
finishing the intake gesture (relatively late).
When averaging the results across all datasets and tasks as
reported in Table III, it becomes clear that training with CTC
loss accounts for the majority of the improvement of single-
stage models over two-stage models. The effect of training
with CTC loss manifests itself in a higher true positive rate
and an associated lower false negative rate. Furthermore, there
is a significant drop in false positives of type 1, which were
previously conjectured to be a restriction of the two-stage
approach [6]. In particular, the single-stage approach avoids
the predefined 2 second gap in Stage 2 of the two-stage
approach and is thus less likely to lead to false positives of
type 1 for gestures with a long duration.
F. Difference between Greedy decoding and Extended beam
search decoding
Recall that greedy decoding only considers the maximum
probability token at each time step, which is equal to extended
prefix beam search decoding with a beam width of 1. As
we increase the beam width, the algorithm considers more
possible alignments and combines their probabilities if they
lead to the same output sequence. In theory, this means that the
results produced by the extended prefix beam search decoding
with a higher beam width better reflect the network’s intended
output than greedy decoding, since they are computed in the
same way as CTC loss works internally.
1 2 3 5 10
0.1%
0.3%
0.5%
Beam width
Relative F1change
Generic intake gestures
|Σ|= 2
Eat and drink gestures
|Σ|= 3
Fig. 7. Average relative F1change with standard deviation when choosing
different beam widths for decoding our models on the validation set. The
base scenario is a beam width of 1, which corresponds to greedy decoding.
We observe that extended prefix beam search decoding mainly benefits the
models for eating and drinking detection, and that there are no improvements
for beam widths greater than 3.
To analyze the effect of different beam widths on the F1
score and determine a beam width to use in our experiments,
we decode our trained networks with different beam widths
on the validation set. As illustrated in Fig. 7, the effect of
extended prefix beam search decoding is not very noticeable
- a relative improvement of only 0.25% on average. In fact,
there is no improvement for beam widths over 3, and hence
we chose beam width 3 for decoding on the test set.
An explanation for these numbers may lie in the few classes
(i.e., only one or two types of gestures to be detected) and the
associated relatively low uncertainty exhibited by our scenario
(i.e., limited variety of foods and environments). This is also
indicated by the low rate of false positives of type 3 in Table
III and the high prediction confidences in Fig. 5. It is well
known that greedy decoding can work well as a heuristic
in cases where most of the probability mass is allotted to a
single alignment [27]. It is evident from Fig. 7 that higher
beam widths mainly benefited our task on eating and drinking
gestures, which has one extra class and hence inherently
carries more uncertainty. Following this line of thought, it
seems likely that the extended prefix beam search algorithm
could lead to higher benefits over greedy decoding for datasets
with more diverse labels and scenarios.
V. DIS CU SS IO N
It is important to note that even though our implementations
of the single-stage approach exhibit performance improve-
ments compared to the two-stage approach, there are also
several other differences between the two approaches that need
to be considered in their application in research and practice.
First, the single-stage approach does not require detailed
labels for the start and end timestamp of an intake gesture,
but only a label for its apex. These simplified labels can
assist in reducing the effort in labelling new datasets or
applying the approach in contexts where there are constraints
on the sampling rate of the ground truth label (e.g. time-lapse
recordings in field settings).
Second, while the probabilities provided by the two-stage
approach align closely with the entire duration of the intake
gesture as provided by the ground truth label, the single-
stage approach only yields individual spikes within the intake
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 10
gesture (see Fig. 5). As such, the information provided by two-
stage models is richer in the sense that they allow to estimate
the duration of the gesture as well as the timing between
gestures, which is not possible with the single-stage approach.
For instance, if the spike in one gesture is towards the start
of the ground truth event, and the spike in the subsequent
gesture is towards the end, one would overestimate the gap
between these gestures. In other words, the simplified labels
of the single-stage approach come with the caveat of simplified
information in its predictions.
Third, both approaches rely on specific yet different as-
sumptions related to the duration of eating gestures. While the
two-stage approach relies on a predefined gap between intake
events (e.g., 2 seconds in [9] [6]), the single-stage approach
requires a window that is sufficiently large to likely capture a
sequence of at least two intake gestures (e.g., 8 seconds). The
predefined gap of the two-stage approach creates the potential
of inadvertently rejecting local probability maxima that are
too close to each other. By contrast, the large window of the
single-stage approach comes with the drawback of increased
memory requirements which are also reflected in the choice
of 2 fps for the video models.
VI. CO NC LU SI ON
In this paper, we introduced a single-stage approach to
detect intake gestures. This is achieved by weakly supervised
training of a deep neural network with CTC loss and decoding
using a novel extended prefix beam search decoding algorithm.
Using CTC loss instead of cross-entropy loss allows us to
interpret intake gesture detection as a sequence labelling
problem, where the network labels an entire sequence as
opposed to doing this independently in a frame-by-frame
fashion. Additionally, we are the first to attempt simultaneous
detection of intake gestures and distinction between eating and
drinking using deep learning. We demonstrate improvements
over the established two-stage approach [9] [6] using two
datasets. These improvements apply to both generic intake
gesture detection and eating/drinking detection tasks, and also
to both video and inertial sensor data.
The proposed extended prefix beam search decoding al-
gorithm is the second novel element in this context besides
CTC loss. This algorithm allows us to decode the probability
estimate provided by the deep neural network in a way that is
consistent with the computation of CTC loss. However, despite
the theoretical benefits of this algorithm, our results show that
training with CTC loss accounts for the lion’s share of the
improvements we see over the two-stage approach. This could
be explained by the low number of classes for the datasets and
tasks considered here. Greedy decoding can hence be seen as
a fast baseline alternative. It remains to be seen in future work
whether extended prefix beam search decoding is more useful
when working with a larger number of classes and higher
associated uncertainty.
While we used the CNN-LSTM framework for our models,
one could also consider alternative architectures. Importantly,
the network must be able to cover the temporal context –
this makes it difficult to directly combine CTC loss with
convolution-only models such as SlowFast [33]. While CTC
loss is traditionally combined with RNNs for this reason,
Transformers have more recently emerged as another feasible
choice [33]. Another topic to be explored in future research is
the effect of choosing different window sizes on model training
and performance.
This work also has several other implications for future
research. We have shown a feasible way of localizing intake
gestures while simultaneously classifying them as eating or
drinking. Given larger video datasets with more different food
types and associated labels, future research could explore more
fine-grained classification of different foods and gestures. The
necessity of large datasets has been pointed out [34] and
detailed food classes are in fact available for the Clemson
dataset, but tentative experiments indicated that inertial sensor
data may not be sufficiently expressive to yield satisfactory
results for food detection. Another implication directly has
to do with the practical task of labelling future datasets.
When working with CTC loss, events do not need to be
painstakingly labelled with a start and end timestamp. Instead,
it is sufficient to mark the apex of the gesture – similar to
how the single-stage approach makes detections – which has
the potential to significantly reduce the labelling workload and
reduce ambiguity around determining the exact start and end
times of intake gestures.
ACK NOW LE DG ME NT
We gratefully acknowledge the support by the Bill &
Melinda Gates Foundation [OPP1171389]. This work was
additionally supported by an Australian Government Research
Training (RTP) Scholarship.
REF ER EN CE S
[1] G. Block, “A review of validations of dietary assessment methods,” Am.
J. Epidemiology, vol. 115, no. 4, pp. 492–505, 1982.
[2] S. W. Lichtman, K. Pisarska, E. R. Berman, M. Pestone, H. Dowling,
E. Offenbacher, H. Weisel, S. Heshka, D. E. Matthews, and S. B. Heyms-
field, “Discrepancy between self-reported and actual caloric intake and
exercise in obese subjects,New England J. Medicine, vol. 327, no. 27,
pp. 1893–1898, 1992.
[3] T. Vu, F. Lin, N. Alshurafa, and W. Xu, “Wearable food intake
monitoring technologies: A comprehensive review,” Computers, vol. 6,
no. 1, pp. 1–28, 2017.
[4] Y. Dong, A. Hoover, J. Scisco, and E. Muth, “A new method for
measuring meal intake in humans via automated wrist motion tracking,”
Applied Psychophysiology and Biofeedback, vol. 37, no. 3, pp. 205–215,
2012.
[5] K. Kyritsis, C. Diou, and A. Delopoulos, “Food intake detection from
inertial sensors using lstm networks,” in Proc. Int. Conf. Image Analysis
and Processing, 2017, pp. 411–418.
[6] P. V. Rouast and M. T. P. Adam, “Learning deep representations for
video-based intake gesture detection,” IEEE J. Biomedical and Health
Informatics, vol. 24, no. 6, pp. 1727–1737, 2020.
[7] J. Qiu, F. P.-W. Lo, and B. Lo, “Assessing individual dietary intake in
food sharing scenarios with a 360 camera and deep learning,” in Proc.
Int. Conf. Wearable and Implantable Body Sensor Networks, 2019, pp.
1–4.
[8] P. V. Rouast, H. Heydarian, M. T. P. Adam, and M. Rollo, “Oreba:
A dataset for objectively recognizing eating behaviour and associated
intake,” IEEE Access, vol. 8, pp. 181 955–181963, 2020.
[9] K. Kyritsis, C. Diou, and A. Delopoulos, “Modeling wrist micromove-
ments to measure in-meal eating behavior from inertial sensor data,”
IEEE J. Biomedical and Health Informatics, vol. 23, no. 6, pp. 2325–
2334, 2020.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3046613, IEEE Journal of
Biomedical and Health Informatics
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 11
[10] H. Heydarian, P. V. Rouast, M. T. P. Adam, T. Burrows, and M. E. Rollo,
“Deep learning for intake gesture detection from wrist-worn inertial
sensors: The effects of preprocessing, sensor modalities, and sensor
positions,” IEEE Access, vol. 8, pp. 164936–164 949, 2020.
[11] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal
modeling for weakly supervised action labeling,” in Proc. Europ. Conf.
Comput. Vision, 2016, pp. 137–153.
[12] H. Heydarian, M. Adam, T. Burrows, C. Collins, and M. E. Rollo,
“Assessing eating behaviour using upper limb mounted motion sensors:
A systematic review,” Nutrients, vol. 11, no. 1168, pp. 1–25, 2019.
[13] O. Amft, M. Stager, P. Lukowicz, and G. Troster, “Analysis of chewing
sounds for dietary monitoring,” in Proc. UbiComp, 2005, pp. 56–72.
[14] O. Amft, M. Kusserow, and G. Troster, “Bite weight prediction from
acoustic recognition of chewing,” IEEE Trans. Biomedical Eng., vol. 56,
no. 6, pp. 1663–1672, 2009.
[15] R. Zhang and O. Amft, “Monitoring chewing and eating in free-living
using smart eyeglasses,IEEE J. Biomedical and Health Informatics,
vol. 22, no. 1, pp. 23–32, 2018.
[16] ——, “Retrieval and timing performance of chewing-based eating event
detection in wearable sensors,” Sensors, vol. 20, no. 2, p. 557, 2020.
[17] E. S. Sazonov and J. M. Fontana, “A sensor system for automatic
detection of food intake through non-invasive monitoring of chewing,”
IEEE Sensors Journal, vol. 12, no. 5, pp. 1340–1348, 2012.
[18] D. Konstantinidis, K. Dimitropoulos, B. Langlet, P. Daras, and
I. Ioakimidis, “Validation of a deep learning system for the full automa-
tion of bite and meal duration analysis of experimental meal videos,”
Nutrients, vol. 12, no. 209, pp. 1–16, 2020.
[19] K. Kyritsis, C. Diou, and A. Delopoulos, “A data driven end-to-end
approach for in-the-wild monitoring of eating behavior using smart-
watches,” IEEE J. Biomedical and Health Informatics, pp. 1–13, 2020.
[20] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connection-
ist temporal classification: labelling unsegmented sequence data with
recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
[21] A. Graves, “Supervised sequence labelling with recurrent neural net-
works,” Ph.D. dissertation, Technische Universit¨
at M¨
unchen, 2008.
[22] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2014, pp.
1764–1772.
[23] M. Liwicki, A. Graves, S. Fern`
andez, H. Bunke, and J. Schmidhuber, “A
novel approach to on-line handwriting recognition based on bidirectional
long short-term memory networks,” in Proc. Int. Conf. Document
Analysis and Recognition, 2007, pp. 1–5.
[24] Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, “Lipnet:
End-to-end sentence-level lipreading,arXiv preprint arXiv:1611.01599,
2016.
[25] Q. Dai, J. Hou, P. Yang, X. Li, F. Wang, and X. Zhang, “Demo:
The sound of silence: End-to-end sign language recognition using
smartwatch,” in Proc. MobiCom, 2017, pp. 462–464.
[26] Q. Zhang, D. Wang, R. Zhao, and Y. Yu, “Myosign: Enabling end-to-end
sign language recognition with wearables,” in Proc. Int. Conf. Intelligent
User Interfaces, 2019, pp. 650–660.
[27] A. Hannun, “Sequence modeling with ctc,” Distill, 2017.
[28] The TensorFlow Authors, “TensorFlow API Docs: tf.nn.ctc loss,” 2020.
[Online]. Available: https://www.tensorflow.org/api docs/python/tf/nn/
ctc loss
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. CVPR, 2016, pp. 770–778.
[30] A. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large
vocabulary continuous speech recognition using bi-directional recurrent
dnns,” arXiv preprint arXiv:1408.2873, 2014.
[31] Y. Shen, J. Salley, E. Muth, and A. Hoover, “Assessing the accuracy of
a wrist motion tracking method for counting bites across demographic
and food variables,IEEE J. Biomedical and Health Informatics, vol. 21,
no. 3, pp. 599–606, 2017.
[32] S. Madgwick, “An efficient orientation filter for inertial and iner-
tial/magnetic sensor arrays,” University of Bristol (UK), Tech. Rep.,
2010.
[33] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for
connectionist temporal classification in speech recognition,” in Proc.
ICASSP, 2019.
[34] Y. Shen, E. Muth, and A. Hoover, “The impact of quantity of
training data on recognition of eating gestures,” arXiv preprint
arXiv:1812.04513, 2018.
Philipp V. Rouast received the B.Sc. and M.Sc.
degrees in Industrial Engineering from Karlsruhe
Institute of Technology, Germany, in 2013 and
2016 respectively. He is currently working towards
the Ph.D. degree in Information Systems and is
a graduate research assistant at The University of
Newcastle, Australia. His research interests include
deep learning, affective computing, HCI, and re-
lated applications of computer vision. Find him at
https://www.rouast.com.
Marc T. P. Adam received the undergraduate degree
in computer science from the University of Applied
Sciences W¨
urzburg, Germany, and the Ph.D. degree
in information systems from the Karlsruhe Insti-
tute of Technology, Germany. He is currently an
Associate Professor of computing and information
technology with The University of Newcastle, Aus-
tralia. His research interests include human-centered
computing with applications in business, education,
and health. He is a Founding Member of the Society
for NeuroIS.
... Examples include Rouast and Adam (2019); Qiu et al. (2020) and Konstantinidis et al. (2020). Rouast and Adam (2020) proposed a method that can be applied to both inertial and video data (but not their combination) and is also able to distinguish between eating and drinking episodes. Finally, Heydarian et al. (2021) presented a method for the fusion of inertial and video data for in-meal eating monitoring. ...
Preprint
Full-text available
The progress in artificial intelligence and machine learning algorithms over the past decade has enabled the development of new methods for the objective measurement of eating, including both the measurement of eating episodes as well as the measurement of in-meal eating behavior. These allow the study of eating behavior outside the laboratory in free-living conditions, without the need for video recordings and laborious manual annotations. In this paper, we present a high-level overview of our recent work on intake monitoring using a smartwatch, as well as methods using an in-ear microphone. We also present evaluation results of these methods in challenging, real-world datasets. Furthermore, we discuss use-cases of such intake monitoring tools for advancing research in eating behavior, for improving dietary monitoring, as well as for developing evidence-based health policies. Our goal is to inform researchers and users of intake monitoring methods regarding (i) the development of new methods based on commercially available devices, (ii) what to expect in terms of effectiveness, and (iii) how these methods can be used in research as well as in practical applications.
... Although this alternative way can recognize consumed food items and associate them with each individual bite (bite count is also obtained) in an end-to-end manner, its efficacy needs to be validated in future work. Despite not to the finegrained food item level, a very recent work using CTC loss and deep networks has shown the success in simultaneously detecting intake events and classifying them as eating or drinking in both video and inertial data [72]. Fifth, counting the number of times of drinking has not been investigated in this work, although drinks have been considered as one of food items and included in food recognition. ...
Article
Full-text available
Assessing dietary intake in epidemiological studies are predominantly based on self-reports, which are subjective, inefficient, and also prone to error. Technological approaches are therefore emerging to provide objective dietary assessments. Using only egocentric dietary intake videos, this work aims to provide accurate estimation on individual dietary intake through recognizing consumed food items and counting the number of bites taken. This is different from previous studies that rely on inertial sensing to count bites, and also previous studies that only recognize visible food items but not consumed ones. As a subject may not consume all food items visible in a meal, recognizing those consumed food items is more valuable. A new dataset that has 1,022 dietary intake video clips was constructed to validate our concept of bite counting and consumed food item recognition from egocentric videos. 12 subjects participated and 52 meals were captured. A total of 66 unique food items, including food ingredients and drinks, were labelled in the dataset along with a total of 2,039 labelled bites. Deep neural networks were used to perform bite counting and food item recognition in an end-to-end manner. Experiments have shown that counting bites directly from video clips can reach 74.15% top-1 accuracy (classifying between 0-4 bites in 20-second clips), and a MSE value of 0.312 (when using regression). Our experiments on video-based food recognition also show that recognizing consumed food items is indeed harder than recognizing visible ones, with a drop of 25% in F1 score. Videos are a rich source that contain both visual and motion information. Assuming dietary intake videos are available, they have the potential to address dietary intake assessment in a more efficient and simpler way than using multi-sensor fusion.
Article
Full-text available
Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of labeled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide comprehensive multi-sensor data recorded during the course of communal meals for researchers interested in intake gesture detection. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consist of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research. Specifically, the best baseline models achieve performances of F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.853 for the discrete dish using video and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> =0.852 for the shared dish using inertial data.
Article
Full-text available
Wrist-worn inertial measurement units have emerged as a promising technology to passively capture dietary intake data. State-of-the-art approaches use deep neural networks to process the collected inertial data and detect characteristic hand movements associated with intake gestures. In order to clarify the effects of data preprocessing, sensor modalities, and sensor positions, we collected and labeled inertial data from wrist-worn accelerometers and gyroscopes on both hands of 100 participants in a semi-controlled setting. The method included data preprocessing and data segmentation, followed by a two-stage approach. In Stage 1, we estimated the probability of each inertial data frame being intake or non-intake, benchmarking different deep learning models and architectures. Based on the probabilities estimated in Stage 1, we detected the intake gestures in Stage 2 and calculated the F1 score for each model. Results indicate that top model performance was achieved by a CNN-LSTM with earliest sensor data fusion through a dedicated CNN layer and a target matching technique (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> = .778). As for data preprocessing, results show that applying a consecutive combination of mirroring, removing gravity effect, and standardization was beneficial for model performance, while smoothing had adverse effects. We further investigate the effectiveness of using different combinations of sensor modalities (i.e., accelerometer and/or gyroscope) and sensor positions (i.e., dominant intake hand and/or non-dominant intake hand).
Article
Full-text available
The increased worldwide prevalence of obesity has sparked the interest of the scientific community towards tools that objectively and automatically monitor eating behavior. Despite the study of obesity being in the spotlight, such tools can also be used to study eating disorders (e.g. anorexia nervosa) or provide a personalized monitoring platform for patients or athletes. This paper presents a complete framework towards the automated i) modeling of in-meal eating behavior and ii) temporal localization of meals, from raw inertial data collected in-the-wild using commercially available smartwatches. Initially, we present an end-to-end Neural Network which detects food intake events (i.e. bites). The proposed network uses both convolutional and recurrent layers that are trained simultaneously. Subsequently, we show how the distribution of the detected bites throughout the day can be used to estimate the start and end points of meals, using signal processing algorithms. We perform extensive evaluation on each framework part individually. Leave-one-subject-out (LOSO) evaluation shows that our bite detection approach outperforms four state-of-the-art algorithms towards the detection of bites during the course of a meal (0.923 F1 score). Furthermore, LOSO and held-out set experiments regarding the estimation of meal start/end points reveal that the proposed approach outperforms a relevant approach found in the literature (Jaccard Index of 0.820 and 0.821 for the LOSO and held-out experiments, respectively). Experiments are performed using our publicly available FIC and the newly introduced FreeFIC datasets.
Article
Full-text available
Eating behavior can have an important effect on, and be correlated with, obesity and eating disorders. Eating behavior is usually estimated through self-reporting measures, despite their limitations in reliability, based on ease of collection and analysis. A better and widely used alternative is the objective analysis of eating during meals based on human annotations of in-meal behavioral events (e.g., bites). However, this methodology is time-consuming and often affected by human error, limiting its scalability and cost-effectiveness for large-scale research. To remedy the latter, a novel “Rapid Automatic Bite Detection” (RABiD) algorithm that extracts and processes skeletal features from videos was trained in a video meal dataset (59 individuals; 85 meals; three different foods) to automatically measure meal duration and bites. In these settings, RABiD achieved near perfect agreement between algorithmic and human annotations (Cohen’s kappa κ = 0.894; F1-score: 0.948). Moreover, RABiD was used to analyze an independent eating behavior experiment (18 female participants; 45 meals; three different foods) and results showed excellent correlation between algorithmic and human annotations. The analyses revealed that, despite the changes in food (hash vs. meatballs), the total meal duration remained the same, while the number of bites were significantly reduced. Finally, a descriptive meal-progress analysis revealed that different types of food affect bite frequency, although overall bite patterns remain similar (the outcomes were the same for RABiD and manual). Subjects took bites more frequently at the beginning and the end of meals but were slower in-between. On a methodological level, RABiD offers a valid, fully automatic alternative to human meal-video annotations for the experimental analysis of human eating behavior, at a fraction of the cost and the required time, without any loss of information and data fidelity.
Article
Full-text available
Automatic detection of individual intake gestures during eating occasions has the potential to improve dietary monitoring and support dietary recommendations. Existing studies typically make use of on-body solutions such as inertial and audio sensors, while video is used as ground truth. Intake gesture detection directly based on video has rarely been attempted. In this study, we address this gap and show that deep learning architectures can successfully be applied to the problem of video-based detection of intake gestures. For this purpose, we collect and label video data of eating occasions using 360-degree video of 102 participants. Applying state-of-the-art approaches from video action recognition, our results show that (1) the best model achieves an F1 score of 0.858, (2) appearance features contribute more than motion features, and (3) temporal context in form of multiple video frames is essential for top model performance.
Article
Full-text available
Wearable motion tracking sensors are now widely used to monitor physical activity, and have recently gained more attention in dietary monitoring research. The aim of this review is to synthesise research to date that utilises upper limb motion tracking sensors, either individually or in combination with other technologies (e.g., cameras, microphones), to objectively assess eating behaviour. Eleven electronic databases were searched in January 2019, and 653 distinct records were obtained. Including 10 studies found in backward and forward searches, a total of 69 studies met the inclusion criteria, with 28 published since 2017. Fifty studies were conducted exclusively in laboratory settings, 13 exclusively in free-living settings, and three in both settings. The most commonly used motion sensor was an accelerometer (64) worn on the wrist (60) or lower arm (5), while in most studies (45), accelerometers were used in combination with gyroscopes. Twenty-six studies used commercial-grade smartwatches or fitness bands, 11 used professional grade devices, and 32 used standalone sensor chipsets. The most used machine learning approaches were Support Vector Machine (SVM, n = 21), Random Forest (n = 19), Decision Tree (n = 16), Hidden Markov Model (HMM, n = 10) algorithms, and from 2017 Deep Learning (n = 5). While comparisons of the detection models are not valid due to the use of different datasets, the models that consider the sequential context of data across time, such as HMM and Deep Learning, show promising results for eating activity detection. We discuss opportunities for future research and emerging applications in the context of dietary assessment and monitoring.
Conference Paper
Full-text available
Automatic sign language recognition is an important milestone in facilitating the communication between the deaf community and hearing people. Existing approaches are either intrusive or susceptible to ambient environments and user diversity. Moreover, most of them perform only isolated word recognition, not sentence-level sequence translation. In this paper, we present MyoSign, a deep learning based system that enables end-to-end American Sign Language (ASL) recognition at both word and sentence levels. We leverage a lightweight wearable device which can provide inertial and electromyography signals to non-intrusively capture signs. First, we propose a multimodal Convolutional Neural Network (CNN) to abstract representations from inputs of different sensory modalities. Then, a bidirectional Long Short Term Memory (LSTM) is exploited to model temporal dependences. On the top of the networks, we employ Connectionist Temporal Classification (CTC) to get around temporal segments and achieve end-to-end continuous sign language recognition. We evaluate MyoSign on 70 commonly used ASL words and 100 ASL sentences from 15 volunteers. Our system achieves an average accuracy of 93.7% at word-level and 93.1% at sentence-level in user-independent settings. In addition, MyoSign can recognize sentences unseen in the training set with 92.4% accuracy. The encouraging results indicate that MyoSign can be a meaningful buildup in the advancement of sign language recognition.
Article
Full-text available
Overweight and obesity are both associated with in-meal eating parameters such as eating speed. Recently, the plethora of available wearable devices in the market ignited the interest of both the scientific community and the industry towards unobtrusive solutions for eating behavior monitoring. In this paper we present an algorithm for automatically detecting the in-meal food intake cycles using the inertial signals (acceleration and orientation velocity) from an off-the-shelf smartwatch. We use 5 specific wrist micromovements to model the series of actions leading to and following an intake event (i.e. bite). Food intake detection is performed in two steps. In the first step we process windows of raw sensor streams and estimate their micromovement probability distributions by means of a Convolutional Neural Network (CNN). In the second step we use a Long-Short Term Memory (LSTM) network to capture the temporal evolution and classify sequences of windows as food intake cycles. Evaluation is performed using a challenging dataset of 21 meals from 12 subjects. In our experiments we compare the performance of our algorithm against three state-of-the-art approaches, where our approach achieves the highest F1 detection score (0.913 in the Leave-One-Subject-Out experiment). The dataset used in the experiments is available at https://mug.ee.auth.gr/intake-cycle-detection/.