Conference PaperPDF Available

A Multi-modal Machine Learning Approach and Toolkit to Automate Recognition of Early Stages of Dementia among British Sign Language Users

Authors:

Abstract and Figures

The ageing population trend is correlated with an increased prevalence of acquired cognitive impairments such as dementia. Although there is no cure for dementia, a timely diagnosis helps in obtaining necessary support and appropriate medication. Researchers are working urgently to develop effective technological tools that can help doctors undertake early identification of cognitive disorder. In particular, screening for dementia in ageing Deaf signers of British Sign Language (BSL) poses additional challenges as the diagnostic process is bound up with conditions such as quality and availability of interpreters, as well as appropriate questionnaires and cognitive tests. On the other hand, deep learning based approaches for image and video analysis and understanding are promising, particularly the adoption of Convolutional Neural Network (CNN), which require large amounts of training data. In this paper, however, we demonstrate novelty in the following way: a) a multi-modal machine learning based automatic recognition toolkit for early stages of dementia among BSL users in that features from several parts of the body contributing to the sign envelope, e.g., hand-arm movements and facial expressions, are combined, b) universality in that it is possible to apply our technique to users of any sign language, since it is language independent, c) given the trade-off between complexity and accuracy of machine learning (ML) prediction models as well as the limited amount of training and testing data being available, we show that our approach is not over-fitted and has the potential to scale up.
Content may be subject to copyright.
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
ECCV
#
ECCV
#
A Multi-modal Machine Learning Approach and
Toolkit to Automate Recognition of Early Stages
of Dementia among British Sign Language Users
Anonymous ECCV submission
Paper ID 10
Abstract. The ageing population trend is correlated with an increased
prevalence of acquired cognitive impairments such as dementia. Although
there is no cure for dementia, a timely diagnosis helps in obtaining nec-
essary support and appropriate medication. Researchers are working ur-
gently to develop effective technological tools that can help doctors un-
dertake early identification of cognitive disorder. In particular, screening
for dementia in ageing Deaf signers of British Sign Language (BSL) poses
additional challenges as the diagnostic process is bound up with condi-
tions such as quality and availability of interpreters, as well as appropri-
ate questionnaires and cognitive tests. On the other hand, deep learning
based approaches for image and video analysis and understanding are
promising, particularly the adoption of Convolutional Neural Network
(CNN), which require large amounts of training data. In this paper,
however, we demonstrate novelty in the following way: a) a multi-modal
machine learning based automatic recognition toolkit for early stages of
dementia among BSL users in that features from several parts of the
body contributing to the sign envelope, e.g., hand-arm movements and
facial expressions, are combined, b) universality in that it is possible to
apply our technique to users of any sign language, since it is language
independent, c) given the trade-off between complexity and accuracy of
machine learning (ML) prediction models as well as the limited amount
of training and testing data being available, we show that our approach
is not over-fitted and has the potential to scale up.
Keywords: Hand Tracking, Facial Analysis, Convolutional Neural Net-
work, Machine Learning, Sign Language, Dementia
1 Introduction
British Sign Language (BSL) is a natural human language, which, like other
sign languages, uses movements of the hands, body and face for linguistic ex-
pression. Recognising dementia in the signers of BSL, however, is still an open
research field, since there is very little information available about dementia in
this population. This is also exacerbated by the fact that there are few clini-
cians with appropriate communication skills and experience working with BSL
users. Diagnosis of dementia is subject to the quality of cognitive tests and BSL
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
ECCV
#
ECCV
#
2 ECCV-20 submission ID
interpreters alike. Hence, the Deaf community currently receives unequal access
to diagnosis and care for acquired neurological impairments, with consequent
poorer outcomes and increased care costs [4].
Facing this challenge, we outlined a deep learning based methodological ap-
proach and developed a toolkit capable of automatically recognising early stages
of dementia without the need for sign language translation or interpretation.
Our approach and tool were inspired by the following two key cross-disciplinary
knowledge contributors:
a) Recent clinical observations suggesting that there may be differences be-
tween signers with dementia and healthy signers with regards to the envelope
of sign space (sign trajectories/depth/speed) and expressions of the face. These
clinical observations indicate that signers who have dementia use restricted sign
space and limited facial expression compared to healthy deaf controls. In this
context, we did not focus only on the hand movements, but also on other features
from the BSL user’s body, e.g., facial expressions.
b) Recent advances in machine learning based approaches spearheaded by
CNN, also known as the Deep Learning approach. These, however, cannot be
applied without taking into consideration contextual restrictions such as avail-
ability of large amounts of training datasets, and lack of real world test data. We
introduce a deep learning based sub-network for feature extraction together with
the CNN approach for diagnostic classification, which yields better performance
and is a good alternative to handle limited data.
In this context, we proposed a multi-featured machine learning methodolog-
ical approach paving the way to the development of a toolkit. The promising
results for its application towards screening for dementia among BSL users lie
with using features other than those bound to overt cognitive testing by using
language translation and interpretation. Our methodological approach comprises
several stages. The first stage of research focuses on analysing the motion pat-
terns of the sign space envelope in terms of sign trajectory and sign speed by
deploying a real-time hand movement trajectory tracking model [1] based on
OpenPose1,2library. The second stage involves the extraction of the facial ex-
pressions of deaf signers by deploying a real-time facial analysis model based
on dlib library3to identify active and non-active facial expressions. The third
stage is to trace elbow joint distribution based on OpenPose library, taken as an
additional feature related to the sign space envelope. Based on the differences in
patterns obtained from facial and trajectory motion data, the further stage of
research implements both VGG16 [26] and ResNet-50 [13] networks using trans-
fer learning from image recognition tasks to incrementally identify and improve
recognition rates for Mild Cognitive Impairment (MCI) (i.e. pre-dementia). Per-
formance evaluation of the research work is based on datasets available from the
Deafness Cognition and Language Research Centre (DCAL) at UCL, which has
a range of video recordings of over 500 signers who have volunteered to partic-
1https://github.com/CMU-Perceptual-Computing-Lab/openpose
2https://github.com/ildoonet/tf-pose-estimation
3http://dlib.net/
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
ECCV
#
ECCV
#
ECCV-20 submission ID 3
ipate in research. It should be noted that as the deaf BSL-using population is
estimated to be around 50,000, the size of this database is equivalent to 1% of
the deaf population. Figure 1 shows the pipeline and high-level overview of the
Fig. 1. The Proposed Pipeline for Dementia Screening
network design. The main contributions of this paper are as follows:
1. We outline a methodology for the prelimnary identification of early stage
dementia among BSL users based on sign language independent features
such as:
an accurate and robust real-time hand trajectory tracking model, in which
both sign trajectory to extract sign space envelope and sign speed to iden-
tify acquired neurological impairment associated with motor symptoms are
tracked.
a real-time facial analysis model that can identify and differentiate active
and non-active facial expressions of a signer.
an elbow distribution model that can identify the motion characteristics
of the elbow joint during signing.
2. We present an automated screening toolkit for early stage dementia assess-
ment with good test set performance of 87.88% in accuracy, 0.93 in ROC,
0.87 in F1 Score for positive MCI/dementia screening results. As the pro-
posed system uses normal 2D videos without requiring any ICT/medical
facilities setup, it is economical, simple, flexible, and adaptable.
The paper is structured as follows: Section 2 gives an overview of the related
work. Section 3 outlines the methodological approach followed by section 4 with
the discussion of experimental design and results. A conclusion provides a sum-
mary of the key contributions and results of this paper.
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
ECCV
#
ECCV
#
4 ECCV-20 submission ID
2 Related Work
Recent advances in computer vision and greater availability in medical imaging
with improved quality have increased the opportunities to develop deep learn-
ing approaches for automated detection and quantification of diseases, such as
Alzhermer and dementia [24]. Many of these techniques have been applied to the
classification of MR imaging, CT scan imaging, FDG-PET scan imaging or the
combined imaging of above, by comparing MCI patients to healthy controls, to
distinguish different types or stages of MCI and accelerated features of ageing
[29, 27, 19, 14]. Jo et al. in [16] reviewed the deep learning papers on Alzheimer
(published between January 2013 and July 2018) with the conclusion that four of
the studies used combination of deep learning and traditional machine learning
approaches, and twelve used deep learning approaches. Due to currently limited
dataset, we also found that ensemble the deep learning approaches for diag-
nostic classification with the traditional machine learning methods for feature
extraction yielded a better performance.
In terms of dementia diagnosis [3], there have been increasing applications
of various machine learning approaches, most commonly with imaging data for
diagnosis and disease progression [21, 10, 15] and less frequently in non-imaging
studies focused on demographic data, cognitive measures [6], and unobtrusive
monitoring of gait patterns over time [11]. In [11], walking speed and its daily
variability may be an early marker of the development of MCI. These and other
real-time measures of function may offer novel ways of detecting transition phases
leading to dementia, which could be another potential research extension to our
toolkit, since the real-time hand trajectory tracking sub-model has the poten-
tial to track a patient’s daily walking pattern and pose recognition as well.
AVEID, an interesting method introduced in [23], uses an automatic video sys-
tem for measuring engagement in dementia, focusing on behaviour on observa-
tional scales and emotion detection. AVEID focused on passive engagement on
gaze and emotion detection, while our method focuses on sign and facial motion
analysis in active signing conversation.
3 Methodology
In this paper, we present a multi-modal feature extraction sub-network inspired
by practical clinical needs, together with the experimental findings associated
with the sub-network. Each feature extraction model is discussed in greater
detail in the following sub-sections and for each method we assume that the
subjects are in front of the camera with only the face, upper body, and arms
visible. The input to the system is short-term clipped videos. Different extracted
motion features will be fed into the CNN network to classify a BSL signer as
healthy or atypical. We present the first phase work on automatic assessment
of early stage dementia based on real-time hand movement trajectory motion
patterns and focusing on performance comparisons between the VGG16 and
ResNet-50 networks. Performance evaluation of the research work is based on
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
ECCV
#
ECCV
#
ECCV-20 submission ID 5
datasets available from the BSL Corpus4at DCAL UCL, a collection of 2D video
clips of 250 Deaf signers of BSL from 8 regions of the UK; and two additional
datasets: a set of data collected for a previous funded project5, and a set of
signer data collected for the present study.
3.1 Dataset
From the video recordings, we selected 40 case studies of signers (20M, 20F) aged
between 60 and 90 years; 21 are signers considered to be healthy cases based on
the British Sign Language Cognitive Screen (BSL-CS); 9 are signers identified
as having Mild Cognitive Impairment (MCI) on the basis of the BSL-CS; and 10
are signers diagnosed with mild MCI through clinical assessment. We consider
those 19 cases as MCI (i.e. early dementia) cases, whether identified through
the BSL-CS or clinically. Balanced datasets (21 Healthy, 19 MCI) are created
in order to decrease the risk of leading to a falsely perceived positive effect
of accuracy due to the bias towards one class. While this number may appear
small, it represents around 2As the video clip for each case is about 20 minutes in
length, we segmented each into 4-5 short video clips - 4 minutes in length - and
fed the segmented short video clip to the multi-modal feature extraction sub-
network. The feasibility study and experimental findings discussed in Section 4.2
show that the segmented video clips represent the characteristics of individual
signers. In this way, we were able to increase the size of the dataset from 40 to
162 clips. Of the 162, 79 have MCI, and 83 are cognitively healthy.
3.2 Real-time Hand Trajectory Tracking Model
OpenPose, developed by Carnegie Mellon University, is one of the state-of-the-
art methods for human pose estimation, processing images through a 2-branch
multi-stage CNN [7]. The real-time hand movement trajectory tracking model is
developed based on the OpenPose Mobilenet Thin model [22]. A detailed eval-
uation of tracking performance is discussed in [1]. The inputs to the system are
brief clipped videos, and only 14 upper body parts in the image are outputted
from the tracking model. These are: eyes, nose, ears, neck, shoulders, elbows,
wrists, and hips. The hand movement trajectory is obtained via wrist joint mo-
tion trajectories. The curve of the hand movement trajectory is connected by
the location of the wrist joint keypoints to track left- and right-hand limb move-
ments across sequential video frames in a rapid and unique way. Figure 2 (top),
demonstrates the tracking process for the sign FARM. Figure 2 (bottom) is the
left- and right-hand trajectories obtained from the tracking model plotted by
wrist location X and Y coordinates over time in a 2D plot. It shows how hand
motion changes over time, which gives a clear indication of hand movement
speed (X-axis speed based on 2D coordinate changes, and Y-axis speed based
4British Sign Language Corpus Project https://bslcorpusproject.org/.
5Overcoming obstacles to the early identification of dementia in the signing Deaf
community
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
ECCV
#
ECCV
#
6 ECCV-20 submission ID
Fig. 2. Real-time Hand Trajectory Tracking (top) and 2D Left- and Right- Hand Tra-
jectory (bottom)
on 2D coordinate changes). A spiky trajectory indicates more changes within a
shorter period, thus faster hand movement. Hand movement speed patterns can
be easily identified to analyse acquired neurological impairments associated with
motor symptoms (i.e. slower movement), as in Parkinson’s disease.
3.3 Real-time Facial Analysis Model
The facial analysis model was implemented based on a facial landmark detec-
tor inside the Dlib library, in order to analyse a signer’s facial expressions [17].
The face detector uses the classic Histogram of Oriented Gradients (HOG) fea-
ture combined with a linear classifier, an image pyramid, and a sliding window
detection scheme. The pre-trained facial landmark detector is used to estimate
the location of 68 (x, y) coordinates that map to facial features (Figure 3). As
Fig. 3. Facial Motion Tracking of a Signer
shown in Figure 46, earlier psychological research [8] identified seven universal
6https://www.eiagroup.com/knowledge/facial-expressions/
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
ECCV
#
ECCV
#
ECCV-20 submission ID 7
common facial expressions: Happiness, Sadness, Fear, Disgust, Anger, Contempt
and Surprise. Facial muscle movements for these expressions include lips and
brows (Figure 4). Therefore, the facial analysis model was implemented for the
purpose of extract subtle facial muscle movement by calculating the average Eu-
clidean distance differences between the nose and right brow as d1, nose and left
brow as d2,and upper and lower lips as d3 for a given signer over a sequence
of video frames (Figure 3). The vector [d1, d2, d3] is an indicator of a signer’s
facial expression and is used to classify a signer as having an active or non-active
facial expression.
Fig. 4. Common Facial Expressions
d1, d2, d3 =
T
X
t=1
|dt+1 dt|
T(1)
where T = Total number of frames that facial landmarks are detected.
3.4 Elbow Distribution Model
The elbow distribution model extracts and represents the motion characteris-
tics of elbow joint movement during signing, based on OpenPose upper body
keypoints. The Euclidean distance dis calculated between the elbow joint coor-
dinate and a relative midpoint of the body in a given frame. This is illustrated
in Figure 5(a), where the midpoint location on the frame is made up of the x-
coordinate of the neck and the y-coordinate of the elbow joint. If Jt
e,n represents
distances of joints elbow and neck (e,n) at time t, such as Jt
e,n = [Xt
e,n, Y t
e,n]
then dcalculates the distance descriptor:
d=p(Xt
nXt
e)2+ (Yt
nYt
e)2(2)
for each frame, resulting in Ndistances d, where Nis the number of frames. In
order to get a distribution representation of elbow motion, a virtual coordinate
origin is created, which is the mean distance calculated as dµ=PN
id
N, which
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
ECCV
#
ECCV
#
8 ECCV-20 submission ID
Fig. 5. (a) Elbow tracking distance from the midpoint. (b) Shifted coordinate with
mean distance calculated
can be seen as the resting position of the elbow. Then a relative distance is
calculated from this origin dµto the elbow joint for each frame, resulting in
the many distances shown in Figure 5(b) as orange dots. If the relative distance
is <0 it is closer to the body than the resting distance, and if it is >0, it
is further away. This is a much better representation of elbow joint movement
as it distinguishes between near and far elbow motion. These points can be
represented by a histogram which can then be fed into the CNN model as an
additional feature.
3.5 CNN Models
In this section, we summarise the architecture of the VGG16 and ResNet-50 im-
plemented for the early dementia classification, focusing on data pre-processing,
architecture overview, and transfer learning in model training.
Data Preprocessing Prior to classification, we first vertical stack a signer’s
left-hand trajectory image over the associated right-hand trajectory image ob-
tained from the real-time hand trajectory tracking model, and label the 162
stacked input trajectory images as pairs
(X, Y ) = {(X1, Y1), ..., (Xi, Yi), ..., (XN, YN)}(N= 162) (3)
where Xiis the i-th observation (image dimension: 1400 ×1558 ×3) from the
MCI and Healthy datasets. The classification has the corresponding class label
Y i ∈ {0,1}, with early MCI (Dementia) as class 0 and Healthy as class 1. The
input images are further normalized by subtracting the ImageNet data mean and
changed the input shape dimensions to 224×224×3 to be ready for the Keras
deep learning CNN networks.
VGG16 and ResNet-50 Architecture In our approach, we have used VGG16
and ResNet-50 as the base models with transfer learning to transfer the parame-
ters pre-trained for 1000 object detection task on ImageNet dataset to recognise
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
ECCV
#
ECCV
#
ECCV-20 submission ID 9
hand movement trajectory images for early MCI screening. Figure 6 shows the
network architecture that we implemented by fine tuning VGG16 and training
ResNet-50 as a classifier alone.
1) VGG16 Architecture: The VGG16 network [26] with 13 convolutional and
3 fully connected (FC) layers, i.e. 16 trainable weight layers, were the basis of
the Visual Geometry Group (VGG) submission to the ImageNet Challenge 2014,
achieving 92.7% top-5 test accuracy, and securing first and second places in the
classification and localization track respectively. Due to the very small dataset,
we fine tune the VGG 16 network by freezing the Convolutional (Covn) layers
and two Fully Connected (FC) layers, and only retrain the last two layers, with
524,674 parameters trainable in total (see Figure 6). Subsequently, a softmax
layer for binary classification is applied to discriminate the two labels: Healthy
and MCI, producing two numerical values of which the sum becomes 1.0.
Several strategies are used to combat overfitting. A dropout layer is imple-
mented after the last FC [28], randomly dropping 40% of the units and their
connections during training. An intuitive explanation of its efficacy is that each
unit learns to extract useful features on its own with different sets of randomly
chosen inputs. As a result, each hidden unit is more robust to random fluctu-
ations and learns a generally useful transformation. Moreover, EarlyStopping
is used to halt the training of the network at the right time to avoid overfit-
ting. EarlyStopping callback is configured to monitor the loss on the validation
dataset with the patience argument set to 15. The training process is stopped
after 15 epochs when there is no improvement on the validation dataset.
2) ResNet-50 Architecture: Residual Networks (ResNets) [13] introduce skip
connections to skip blocks of convolutional layers, forming a residual block. These
stacked residual blocks greatly improve training efficiency and largely resolve
the vanishing gradient problem present in deep networks. This model won the
ImageNet challenge in 2015; the top 5 accuracy for ResNet-50 is 93.29%. As
complex models with many parameters are more prone to overfitting with a
small dataset, we train ResNet-50 as a classifier alone rather than fine tune it
(see Figure 6). Only a softmax layer for binary classification is applied, which
introduces 4098 trainable parameters. EarlyStopping callback is also configured
to halt the training of the network in order to avoid overfitting.
4 Experiments and Analysis
4.1 Implementation
The networks mentioned above were constructed using Python 3.6.8, OpenCV
3.4.2 [2], and Tensorflow 1.12. VGG16 and ResNet-50 were built with the Keras
deep learning library [9], using Tensorflow as backend. We employed a Windows
desktop with two Nvidia GeForce GTX 1080Ti adapter cards and 3.3 GHz Intel
Core i9-7900X CPU with 16 GB RAM. During training, dropout was deployed
in fully connected layers and EarlyStopping was used to avoid overfitting. To
accelerate the training process and avoid local minimums, we used Adam algo-
rithm with its default parameter setting (learning rate=0.001, beta 1=0.9, beta
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
#
ECCV
#
10 ECCV-20 submission ID
Fig. 6. VGG16 and ResNet-50 Archtecture
2=0.999) as the training optimizer [18]. Batch size was set to 3 when training
VGG16 network and 1 when training ResNet-50 network, as small mini-batch
sizes provide more up-to-date gradient calculations and yield more stable and
reliable training [5, 20]. In training it took several ms per epoch, with ResNet-50
quicker than the other because of less in training parameters. As an ordinary
training schedule contains 100 epochs, in most cases, the training loss would
converge in 40 epochs for VGG16 and 5 epochs for ResNet-50. During train-
ing, the parameters of the networks were saved via Keras callbacks to monitor
EarlyStopping to save the best weights. These parameters were used to run the
test and validation sets later. During test and validation, accuracies and Receiver
Operating Characteristic (ROC) curves of the classification were calculated, and
the network with the highest accuracy and area under ROC was chosen as the
final classifier.
4.2 Results and Discussion
Experiment Findings In Figure 7, feature extraction results show that in a
greater number of cases a signer with MCI produces a sign trajectory that resem-
bles a straight line rather than the spiky trajectory characteristic of a healthy
signer. In other words, signers with MCI produced more static poses/pauses
during signing, with a reduced sign space envelope as indicated by smaller am-
plitude differences between the top and bottom peaks of the X, Y trajectory
lines. At the same time, the Euclidean distance d3 of healthy signers is larger
than that of MCI signers, indicating active facial movements by healthy signers.
This proves the clinical observation concept of differences between signers with
MCI and healthy signers in the envelope of sign space and face movements, with
the former using smaller sign space and limited facial expression. In addition
to space and facial expression, the elbow distribution model demonstrates re-
stricted movement around the elbow axis with a lower standard deviation and
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
ECCV
#
ECCV
#
ECCV-20 submission ID 11
Fig. 7. Experiment Finding
a skewed distribution for the MCI signer compared to the healthy signer where
the distribution is normal (Figure 8).
Fig. 8. The top row shows signing space for a healthy (left) and an MCI (right) signer.
The bottom row shows the acquired histograms and normal probability plots for both
hands. For data protection purposes both faces have been covered.
Performance Evaluation In this section, we have performed a comparative
study of VGG16 and ResNet-50 networks. Videos of 40 participants have been
segmented into short clips with 162 segmented cases in the training processes.
Those segmented samples are randomly partitioned into two subsets with split-
ting into 80% for the training set and 20% for the test set. To validate the model
performance, we also kept 6 cases separate (1 MCI and 5 healthy signers) that
have not been used in the training process, segmented into 24 cases for per-
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
ECCV
#
ECCV
#
12 ECCV-20 submission ID
formance validation. The validation samples is skewed as a result of limited in
MCI samples but richer in health samples. More MCI samples are kept in the
training/test processes than in the validation. Table 1 shows effectiveness results
over 46 participants from different networks. Results are summarised in Figure 9
Table 1. Performance Evaluation over VGG16 and RestNet-50 for early MCI screening
Method 40 Participants
21 Healthy, 19 Early MCI
6 Participants
5 Healthy, 1 Early MCI
Train Result
(129 segmented cases)
Test Result
(33 segmented cases)
Validation Result
(24 segmented cases)
ACC ACC ROC ACC ROC
VGG 16 87.5969% 87.8788% 0.93 87.5% 0.96
ResNet-50 69.7674% 69.6970% 0.72 66.6667% 0.73
and Figure 10 on test set. The best performance metrics are achieved by VGG16
with accuracy of 87.8788%; a micro ROC of 0.93 (ROC for MCI: 0.93, Healthy:
0.93); F1 score for MCI: 0.87, for Healthy: 0.89. Therefore, VGG16 was selected
as the baseline classifier and validation was further performed on 24 sub-cases
from 6 participants. Table 2 summarises validation performance over the base-
line classifier VGG16, and its ROC in Figure 11. In Table 2, there are two false
positive and one false negative based on sub-case prediction, but the model has
a correct high confidence prediction rate on most of the sub-cases. If prediction
confidence is averaged over all of the sub-cases from a participant, and predict
the result, the model achieved 100% accuracy in validation performance.
Fig. 9. Test Set Confusion Matrix of VGG16 (left two) and ResNet-50 (right two)
Furthermore, since a deep learning network can easily become over-fitted
with relatively small datasets, comparison against simpler approaches such as
logistic regression and SVM is also performed. As stated in [12], logistic regres-
sion and artificial neural networks are the models of choice in many medical data
classification tasks, with one layer of hidden neurons generally sufficient for clas-
sifying most datasets. Therefore, we evaluate our datasets on a 2-layer shallow
neural network with 80 neurons in hidden layer and logistic sigmoid activation
as its output layer.
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
ECCV
#
ECCV
#
ECCV-20 submission ID 13
Fig. 10. Test Set ROC of VGG16 (left) and ResNet-50 (right)
Table 2. Validation Performance over Baseline Classifier - VGG16
Participant No Sub-case Prediction
Confidence
Prediction
Result
based on
Sub-case
Prediction
Result
based on
Participant
Ground Truth
MCI Health
1 1 1 0.63 0.37 MCI Healthy Healthy
1 2 0.43 0.57 Healthy
1 3 0.39 0.61 Healthy
1 4 0.27 0.73 Healthy
1 5 0.40 0.60 Healthy
2 2 1 0.13 0.87 Healthy Healthy Healthy
2 2 0.02 0.98 Healthy
2 3 0.56 0.44 MCI
2 4 0.23 0.77 Healthy
3 3 1 0.08 0.92 Healthy Healthy Healthy
3 2 0.02 0.98 Healthy
3 3 0.02 0.98 Healthy
3 4 0.01 0.99 Healthy
4 4 1 0.09 0.91 Healthy Healthy Healthy
4 2 0.24 0.76 Healthy
4 3 0.16 0.84 Healthy
4 4 0.07 0.93 Healthy
5 5 1 0.01 0.99 Healthy Healthy Healthy
5 2 0.01 0.99 Healthy
5 3 0.00 1.00 Healthy
5 4 0.07 0.93 Healthy
6 6 1 0.93 0.07 MCI MCI MCI
6 2 0.29 0.71 Healthy
6 3 0.91 0.09 MCI
Fig. 11. Validation Set ROC on VGG16
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
ECCV
#
ECCV
#
14 ECCV-20 submission ID
Table 3. Comparing Deep Neural Network Architecture over Shallow Networks
Train Accuracy (%) Test Accuracy (%)
VGG16 87.5969 87.8788
Shallow Logistic 86.4865 86.1538
SVM 86.8725 73.8461
Our observations on comparison results in respect with accuracy between
shallow (Shallow Logistic, SVM) and deep CNN ML prediction models, presented
in Table 3, show that, for smaller datasets, shallow models are a considerable
alternative to deep learning models, since no significant improvement could be
shown. Deep learning models, however, have the potential to perform better in
the presence of larger datasets [25]. Since we aspire to train and apply our model
with increasingly larger amounts of data made available, our approach is well
justified. The comparisons also highlighted that our ML prediction model is not
over-fitted despite the fact that small amounts of training and testing data were
available.
5 Conclusions
We have outlined a multi-modal machine learning methodological approach and
developed a toolkit for an automatic dementia screening system. The toolkit uses
VGG16, while focusing on analysing features from various body parts, e.g., facial
expressions, comprising the sign space envelope of BSL users recorded in normal
2D videos. As part of our methodology, we report the experimental findings for
the multi-modal feature extractor sub-network in terms of hand sign trajectory,
facial motion, and elbow distribution, together with performance comparisons
between different CNN models in ResNet-50 and VGG16. The experiments show
the effectiveness of our machine learning based approach for early stage dementia
screening. The results are validated against cognitive assessment scores with a
test set performance of 87.88%, and a validation set performance of 87.5% over
sub-cases, and 100% over participants. Due to its key features of being economic,
simple, flexible, and adaptable, the proposed methodological approach and the
implemented toolkit have the potential for use with other sign languages, as
well as in screening for other acquired neurological impairments associated with
motor changes, such as stroke and Parkinson’s disease in both hearing and deaf
people.
6 Funding
This work has been supported by the (name is not given in respect to anonymity
of the paper) Grant RPGF...\.. UK.
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
ECCV
#
ECCV
#
ECCV-20 submission ID 15
References
1. Authors paper.
2. Opencv: https://opencv.org/
3. Astell, A., Bouranis, N., Hoey, J., Lindauer, A., Mihailidis, A., Nugent, C., Robil-
lardi, J.: Technology and dementia: The future is now. In: Dementia and Geriatric
Cognitive Disorders 47(3), 131–139 (2019). https://doi.org/doi:10.1159/000497800
4. Atkinson, J., Marshall, J., Thacker, A., Woll, B.: When sign language breaks down:
Deaf people’s access to language therapy in the uk. In: Deaf Worlds 18, 9–21 (2002)
5. Bengio, Y.: Practical recommendations for gradient-based training of deep archi-
tectures. Neural Networks: Tricks of the Trade (2012)
6. Bhagyashree, S.I., Nagaraj, K., Prince, M., Fall, C., Krishna, M.: Diagnosis of de-
mentia by machine learning methods in epidemiological studies: a pilot exploratory
study from south india. In: Social Psychiatry and Psychiatric Epidemiology 53(1),
77–86 (2018)
7. Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimation
using part affinity fields. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) pp. 7291–7299 (2017)
8. Charles, D., Paul, E., Phillip, P.: The expression of the emotions in man and
animals. 3rd edn, London: Harper Collins (1998)
9. Chollet, F., et al.: Keras: https://keras.io (2015)
10. Dallora, A., Eivazzadeh, S., Mendes, E., Berglund, J., Anderberg, P.: Ma-
chine learning and microsimulation techniques on the prognosis of de-
mentia: A systematic literature review. In: PLoS One 12(6) (2017).
https://doi.org/doi:10.1371/journal.pone.0179804
11. Dodge, H., Mattek, N., Austin, D., Hayes, T., Kaye, J.: In-home walking speeds and
variability trajectories associated with mild cognitive impairment. In: Neurology
78(24), 1946–1952 (2012)
12. Dreiseitl, S., Ohno-Machado, L.: Logistic regression and artificial neural network
classification models: a methodology review. In: Journal of Biomedical Informatics
35, 352–359 (2002)
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of Computer Vision and Pattern Recognition (CVPR) (2016)
14. Huang, Y., Xu, J., Zhou, Y., Tong, T., Zhuang, X., ADNI: Diagnosis of alzheimer’s
disease via multi-modality 3d convolutional neural network. In: Front Neuroscience
13(509) (2019). https://doi.org/doi:10.3389/fnins.2019.00509
15. Iizuka, T., Fukasawa, M., Kameyama, M.: Deep-learning-based imaging-
classification identified cingulate island sign in dementia with lewy bodies. In:
Scientific Reports 9(8944) (2019). https://doi.org/doi:10.1038/s41598-019-45415-5
16. Jo, T., Nho, K., Saykin, A.: Deep learning in alzheimer’s disease: Diagnostic classi-
fication and prognostic prediction using neuroimaging data. In: Front Aging Neu-
roscience 11(220) (2019). https://doi.org/doi:10.3389/fnagi.2019.00220
17. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regres-
sion trees. In: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2014). https://doi.org/doi:10.1109/CVPR.2014.241
18. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings
of International Conference on Learning Representations (2015)
19. Lu, D., Popuri, K., Ding, G.W., Balachandar, R., Beg, M., ADNI: Multimodal and
multiscale deep neural networks for the early diagnosis of alzheimer’s disease using
structural mr and fdg-pet images. In: Scientific Reports 8(1), 5697 (2018)
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
ECCV
#
ECCV
#
16 ECCV-20 submission ID
20. Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks.
In: Proceedings of International Conference on Learning Representations
21. Negin, F., Rodriguez, P., Koperski, M., Kerboua, A., Gonz`alez, J., Bourgeois,
J., Chapoulie, E., Robert, P., Bremond, F.: Praxis: Towards automatic cognitive
assessment using gesture. In: Expert Systems with Applications 106, 21–35 (2018)
22. OpenPoseTensorFlow: https://github.com/ildoonet/tf-pose-estimation
23. Parekh, V., Foong, P.S., Zhao, S., Subramanian, R.: Aveid: Automatic video sys-
tem for measuring engagement in dementia. In: Proceedings of the International
Conference on Intelligent User Interfaces (IUI ’18) pp. 409–413 (2018)
24. Pellegrini, E., Ballerini, L., Hernandez, M., Chappell, F., Gonz´alez-Castro, V., An-
blagan, D., Danso, S., Maniega, S., Job, D., Pernet, C., Mair, G., MacGillivray,
T., Trucco, E., Wardlaw, J.: Machine learning of neuroimaging to diagnose cogni-
tive impairment and dementia: a systematic review and comparative analysis. In:
Alzheimer’s Dementia: Diagnosis, Assessment Disease Monitoring 10, 519–535
(2018)
25. Schindler, A., Lidy, T., Rauber, A.: Comparing shallow versus deep neural net-
work architectures for automatic music genre classification. In: 9th Forum Media
Technology (FMT2016) 1734, 17–21 (2016)
26. Simonyan, K., Zisserman, A.: Very deep convolutional net-works for large-scale
image recognition. In: Proceedings of International Conference on Learning Rep-
resentations (2015)
27. Spasova, S., Passamonti, L., Duggento, A., Li`o, P., Toschi, N., ADNI: A parameter-
efficient deep learning approach to predict conversion from mild cognitive impair-
ment to alzheimer’s disease. In: NeuroImage 189, 276–287 (2019)
28. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to preventneural networks from overfitting. In: Journal
of Machine Learning Research 15, 1929–1958 (2014)
29. Young, A., Marinescu, R., Oxtoby, N., Bocchetta, M., Yong, K., Firth, N.,
Cash, D., Thomas, D., Dick, K., Cardoso, J., Swieten, J., Borroni, B., Galim-
berti, D., Masellis, M., Tartaglia, M., Rowe, J., Graff, C., Tagliavini, F., Frisoni,
G., Laforce, R., Finger, E., Mendon¸ca, A., Sorbi, S., Warren, J., Crutch, S.,
Fox, N., Ourselin, S., Schott, J., Rohrer, J., Alexander, D.: The genetic ftd
initiative (genfi), the alzheimer’s disease neuroimaging initiative (adni): Uncov-
ering the heterogeneity and temporal complexity of neurodegenerative diseases
with subtype and stage inference. In: Nature Communications 9(4273) (2018).
https://doi.org/doi:10.1038/s41467-018-05892-0
... In the work by L. Xing et al. [250], authors propose using video features with their dataset, which contains recordings of individuals using sign language, both those with MCI and healthy individuals. The videos were preprocessed to extract various video features, including facial landmarks and poses. ...
Preprint
Full-text available
Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer's disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help in this detection, it often involves invasive procedures. An alternative approach is to employ non-intrusive techniques such as speech or handwriting analysis, which do not necessarily affect daily activities. This survey reviews the most relevant methodologies that use deep learning techniques to automate the cognitive decline estimation task, including audio, text, and visual processing. We discuss the key features and advantages of each modality and methodology, including state-of-the-art approaches like Transformer architecture and foundation models. In addition, we present works that integrate different modalities to develop multimodal models. We also highlight the most significant datasets and the quantitative results from studies using these resources. From this review, several conclusions emerge. In most cases, the textual modality achieves the best results and is the most relevant for detecting cognitive decline. Moreover, combining various approaches from individual modalities into a multimodal model consistently enhances performance across nearly all scenarios.
... As a challenging computer vision task, 3D human pose estimation aims to localize 3D human body keypoints in images and videos. 3D Human pose estimation has an important role in several vision tasks and applications such as action recognition [24,29,57,60], human mesh recovery [16,23], motion generation [51,56], sign language [17,22,26,34], augmented/virtual reality [1,4], and robotics [9][10][11][12]48]. In recent years, the biggest leap forward in 3D human pose estimation was including the temporal aspect and predicting entire sequences of skeletons at once [61,63]. ...
Preprint
We introduce a novel approach for 3D whole-body pose estimation, addressing the challenge of scale- and deformability- variance across body parts brought by the challenge of extending the 17 major joints on the human body to fine-grained keypoints on the face and hands. In addition to addressing the challenge of exploiting motion in unevenly sampled data, we combine stable diffusion to a hierarchical part representation which predicts the relative locations of fine-grained keypoints within each part (e.g., face) with respect to the part's local reference frame. On the H3WB dataset, our method greatly outperforms the current state of the art, which fails to exploit the temporal information. We also show considerable improvements compared to other spatiotemporal 3D human-pose estimation approaches that fail to account for the body part specificities. Code is available at https://github.com/valeoai/PAFUSE.
... Gündüz, C., et al. [129] cropped face regions and hand regions from original video frames and processed them in different streams. Liang, X., et al. [130] developed an automatic toolkit for British sign language recognition, which took hand-arms movements and facial expressions into consideration. Zheng, J., et al. [131] proposed a face highlight module to extract facial expression information and fused it with non-facial features. ...
Article
Full-text available
The Deaf are a large social group in society. Their unique way of communicating through sign language is often confined within their community due to limited understanding by individuals outside of this demographic. This is where sign language recognition (SLR) comes in to help normal people understand the meaning of sign language. In recent years, new methods of sign language recognition have been developed and achieved good results, so it is necessary to make a summary. This review mainly focuses on the introduction of sign language recognition techniques based on algorithms especially in recent years, including the recognition models based on traditional methods and deep learning approaches, sign language datasets, challenges and future directions in SLR. To make the method structure clearer, this article explains and compares the basic principles of different methods from the perspectives of feature extraction and temporal modelling. We hope that this review will provide some reference and help for future research in sign language recognition.
... 3D hand shape and texture reconstruction from a single RGB image is a challenging problem that has numerous applications such as human-machine interaction [1,2], virtual and augmented reality [3][4][5][6], and sign language translation [7]. In recent years, there has been significant progress in reconstructing 3D hand pose and shape from a monocular images [8][9][10][11][12][13][14][15][16]. ...
Preprint
Full-text available
Recently, deep learning based approaches have shown promising results in 3D hand reconstruction from a single RGB image. These approaches can be roughly divided into model-based approaches, which are heavily dependent on the model's parameter space, and model-free approaches, which require large numbers of 3D ground truths to reduce depth ambiguity and struggle in weakly-supervised scenarios. To overcome these issues, we propose a novel probabilistic model to achieve the robustness of model-based approaches and reduced dependence on the model's parameter space of model-free approaches. The proposed probabilistic model incorporates a model-based network as a prior-net to estimate the prior probability distribution of joints and vertices. An Attention-based Mesh Vertices Uncertainty Regression (AMVUR) model is proposed to capture dependencies among vertices and the correlation between joints and mesh vertices to improve their feature representation. We further propose a learning based occlusion-aware Hand Texture Regression model to achieve high-fidelity texture reconstruction. We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios. The experimental results demonstrate our probabilistic model's state-of-the-art accuracy in 3D hand and texture reconstruction from a single image in both training schemes, including in the presence of severe occlusions.
... 3D human pose estimation, which aims to predict the 3D coordinates of human joints from images or videos, is an important task with a wide range of applications, including augmented reality [4], sign language translation [15] and human-robot interaction [33], attracting a lot of attention in recent years [17,39,43,45]. Generally, the mainstream approach is to conduct 3D pose estimation in two stages: the 2D pose is first obtained with a 2D pose detector, and then 2D-to-3D lifting is performed (where the lifting process is the primary aspect that most recent works [1,10,11,13,25,46] focus on). ...
Preprint
Full-text available
Monocular 3D human pose estimation is quite challenging due to the inherent ambiguity and occlusion, which often lead to high uncertainty and indeterminacy. On the other hand, diffusion models have recently emerged as an effective tool for generating high-quality images from noise. Inspired by their capability, we explore a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process. We incorporate novel designs into our DiffPose that facilitate the diffusion process for 3D pose estimation: a pose-specific initialization of pose uncertainty distributions, a Gaussian Mixture Model-based forward diffusion process, and a context-conditioned reverse diffusion process. Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP.
Chapter
En la presente obra, se intenta abonar a una discusión que trascienda la superficial dualidad entre filia y fobia. Por esa misma razón, se ha invitado a estudiosos de los más diversos campos de estudio con la finalidad de proponer una hoja de ruta para entender el impacto social de la inteligencia artificial. Solamente a través de un camino con estas características, nos será posible romper el caparazón de las simplificaciones, para darnos cuenta que el verdadero estado del arte es más profundo que la dicotomía. En el corazón de la inteligencia artificial existen verdadera-mente posibilidades técnicas de resolver problemas que acechan desde hace mucho tiempo a la humanidad, pero también plantea algunos riesgos que no podemos hacer a un lado. A lo largo de los apartados del presente libro, los autores plantean estos diversos escenarios en el campo de la ecología, la educación, la psicología, la agricultura, la medicina y hasta la filosofía o la historia.
Chapter
Full-text available
Es indudable que a partir del desarrollo de la inteligencia artificial (IA) y los sistemas de procesamiento de lenguaje natural (PLN), las computadoras comenzaron a comprender, interpretar y generar lenguaje humano en sus diversas formas, sin embargo, estos sistemas cuentan con un mayor desarrollo en relación con las lenguas orales que a las lenguas de modalidad visogestual. Esto responde a la colección de datos lingüísticos que contamos en la actualidad, la cual, es significativamente mayor en lenguas orales que señadas. Es decir, que el desarrollo de la IA y del PLN tiene en la actualidad avances importantes en cuanto a la capacidad de las máquinas para interpretar y generar texto en diversos contextos, desde traductores automáticos hasta asistentes virtuales, pasando por chatbots. En este capítulo discutimos la importancia del reconocimiento auto-mático de las lenguas de modalidad visogestual, vital para la inclusión y comunicación efectiva de las personas sordas. En primer lugar, estos sistemas representan un avance tecnológico significativo en el campo de la accesibilidad, permitiendo una mayor autonomía y participación de este sector de la población en diversos contextos sociales y profesionales. Algunos de los beneficios de estas aplicaciones y su implementación pueden repercutir en tecnologías de reconocimiento automático de lenguas de señas las cuales pueden facilitar la comunicación entre personas sordas y oyentes, reduciendo las barreras lingüísticas y promoviendo la inclusión social, entre otros beneficios.
Book
Full-text available
Los coordinadores estamos seguros de que los lectores encontrarán en esta obra material que, aunque de carácter divulgativo, tendrá la profundidad suficiente para permitir vislumbrar las múltiples implicaciones sociales que traen consigo las modernas tecnologías de la inteligencia artificial. Como hemos señalado, en su conjunto, el contenido de los diversos capítulos va más allá de la dicotomía entre tecnofilia y tecnofobia que, aunque simplifica el discurso, lo vuelve simplista y binario, sin ofrecer los matices que se encuentran en el mundo y la vida real. La visión que aquí se ofrece es matizada también por su origen multidisciplinario. En esta época donde la IA ha cobrado nuevos bríos, no son pocas las publicaciones donde tecnólogos optimistas tratan de justificar a ultranza los usos de la tecnología, ofreciendo una miope visión de los alcances sociales. Tampoco faltan las publicaciones de estudiosos de las Ciencias Sociales que plantean de forma adecuada las implicaciones sociales pero que tienen una óptica distorsionada sobre los alcances reales y el funcionamiento actual de las tecnologías en cuestión. El carácter multidisciplinario, contrastado, del material de este libro lo convierte en rara avis y lo distingue de otras muchas publicaciones que se podrán encontrar en torno a la IA. Invitamos entonces a los lectores a explorar este mundo de contrastes, de posibilidades, de fuentes de inspiración, de importantes advertencias y de potenciales futuros, que ofrece la inteligencia artificial en nuestras sociedades.
Chapter
Full-text available
El trabajo explora cómo las iniciativas de descolonización de las tecnologías digitales pueden transformar la forma en que concebimos la inteligencia artificial. Por ejemplo, al imaginar una interfaz inspirada en la milpa mesoamericana en lugar de la tradicional oficina occidental, planteamos nuevas posibilidades para la innovación y la comunicación.
Article
Full-text available
Deep learning, a state-of-the-art machine learning approach, has shown outstanding performance over traditional machine learning in identifying intricate structures in complex high-dimensional data, especially in the domain of computer vision. The application of deep learning to early detection and automated classification of Alzheimer's disease (AD) has recently gained considerable attention, as rapid progress in neuroimaging techniques has generated large-scale multimodal neuroimaging data. A systematic review of publications using deep learning approaches and neuroimaging data for diagnostic classification of AD was performed. A PubMed and Google Scholar search was used to identify deep learning papers on AD published between January 2013 and July 2018. These papers were reviewed, evaluated, and classified by algorithm and neuroimaging type, and the findings were summarized. Of 16 studies meeting full inclusion criteria, 4 used a combination of deep learning and traditional machine learning approaches, and 12 used only deep learning approaches. The combination of traditional machine learning for classification and stacked auto-encoder (SAE) for feature selection produced accuracies of up to 98.8% for AD classification and 83.7% for prediction of conversion from mild cognitive impairment (MCI), a prodromal stage of AD, to AD. Deep learning approaches, such as convolutional neural network (CNN) or recurrent neural network (RNN), that use neuroimaging data without pre-processing for feature selection have yielded accuracies of up to 96.0% for AD classification and 84.2% for MCI conversion prediction. The best classification performance was obtained when multimodal neuroimaging and fluid biomarkers were combined. Deep learning approaches continue to improve in performance and appear to hold promise for diagnostic classification of AD using multimodal neuroimaging data. AD research that uses deep learning is still evolving, improving performance by incorporating additional hybrid data types, such as—omics data, increasing transparency with explainable approaches that add knowledge of specific disease-related features and mechanisms.
Article
Full-text available
Background: Technology has multiple potential applications to dementia from diagnosis and assessment to care delivery and supporting ageing in place. Objectives: To summarise key areas of technology development in dementia and identify future directions and implications. Method: Members of the US Alzheimer's Association Technology Professional Interest Area involved in delivering the annual pre-conference summarised existing knowledge on current and future technology developments in dementia. Results: The main domains of technology development are as follows: (i) diagnosis, assessment and monitoring, (ii) maintenance of functioning, (iii) leisure and activity, (iv) caregiving and management. Conclusions: The pace of technology development requires urgent policy, funding and practice change, away from a narrow medical approach, to a holistic model that facilitates future risk reduction and prevention strategies, enables earlier detection and supports implementation at scale for a meaningful and fulfilling life with dementia.
Article
Full-text available
The differentiation of dementia with Lewy bodies (DLB) from Alzheimer’s disease (AD) using brain perfusion single photon emission tomography is important but is challenging because these conditions exhibit typical features. The cingulate island sign (CIS) is the most recently identified specific feature of DLB for a differential diagnosis. The current study aimed to examine the usefulness of deep-learning-based imaging classification for the diagnoses of DLB and AD. Furthermore, we investigated whether CIS was emphasized by a deep convolutional neural network (CNN) during differentiation. Brain perfusion single photon emission tomography images from 80 patients, each with DLB and AD, and 80 individuals with normal cognition (NL) were used for training and 20 each for final testing. The CNN was trained on brain surface perfusion images. Gradient-weighted class activation mapping (Grad-CAM) was applied to the CNN to visualize the features that was emphasized by the trained CNN. The binary classifications between DLB and NL, DLB and AD, and AD and NL were 93.1%, 89.3%, and 92.4% accurate, respectively. The CIS ratios closely correlated with the output scores before softmax for DLB–AD discrimination (DLB/AD scores). The Grad-CAM highlighted CIS in the DLB discrimination. Visualization of learning process by guided Grad-CAM revealed that CIS became more focused by the CNN as the training progressed. The DLB/AD score was significantly associated with the three core features of DLB. Deep-learning-based imaging classification was useful for an objective and accurate differentiation of DLB from AD and for predicting clinical features of DLB. The CIS was identified as a specific feature during DLB classification. The visualization of specific features and learning processes could be critical in deep learning to discover new imaging features.
Article
Full-text available
Alzheimer’s disease (AD) is one of the most common neurodegenerative diseases. In the last decade, studies on AD diagnosis has attached great significance to artificial intelligence-based diagnostic algorithms. Among the diverse modalities of imaging data, T1-weighted MR and FDG-PET are widely used for this task. In this paper, we propose a convolutional neural network (CNN) to integrate all the multi-modality information included in both T1-MR and FDG-PET images of the hippocampal area, for the diagnosis of AD. Different from the traditional machine learning algorithms, this method does not require manually extracted features, instead, it utilizes 3D image-processing CNNs to learn features for the diagnosis or prognosis of AD. To test the performance of the proposed network, we trained the classifier with paired T1-MR and FDG-PET images in the ADNI datasets, including 731 cognitively unimpaired (labeled as CN) subjects, 647 subjects with AD, 441 subjects with stable mild cognitive impairment (sMCI) and 326 subjects with progressive mild cognitive impairment (pMCI). We obtained higher accuracies of 90.10% for CN vs. AD task, 87.46% for CN vs. pMCI task, and 76.90% for sMCI vs. pMCI task. The proposed framework yields a state-of-the-art performance. Finally, the results have demonstrated that (1) segmentation is not a prerequisite when using a CNN for the classification, (2) the combination of two modality imaging data generates better results.
Article
Full-text available
The heterogeneity of neurodegenerative diseases is a key confound to disease understanding and treatment development, as study cohorts typically include multiple phenotypes on distinct disease trajectories. Here we introduce a machine-learning technique—Subtype and Stage Inference (SuStaIn)—able to uncover data-driven disease phenotypes with distinct temporal progression patterns, from widely available cross-sectional patient studies. Results from imaging studies in two neurodegenerative diseases reveal subgroups and their distinct trajectories of regional neurodegeneration. In genetic frontotemporal dementia, SuStaIn identifies genotypes from imaging alone, validating its ability to identify subtypes; further the technique reveals within-genotype heterogeneity. In Alzheimer’s disease, SuStaIn uncovers three subtypes, uniquely characterising their temporal complexity. SuStaIn provides fine-grained patient stratification, which substantially enhances the ability to predict conversion between diagnostic categories over standard models that ignore subtype (p = 7.18 × 10⁻⁴) or temporal stage (p = 3.96 × 10⁻⁵). SuStaIn offers new promise for enabling disease subtype discovery and precision medicine.
Article
Full-text available
INTRODUCTION: Advanced machine learning methods might help to identify dementia risk from neuroimaging, but their accuracy to date is unclear. METHODS: We systematically reviewed the literature, 2006 to late 2016, for machine learning studies differentiating healthy ageing through to dementia of various types, assessing study quality, and comparing accuracy at different disease boundaries. RESULTS: Of 111 relevant studies, most assessed Alzheimer's disease (AD) vs healthy controls, used ADNI data, support vector machines and only T1-weighted sequences. Accuracy was highest for differentiating AD from healthy controls, and poor for differentiating healthy controls vs MCI vs AD, or MCI converters vs non-converters. Accuracy increased using combined data types, but not by data source, sample size or machine learning method. DISCUSSION: Machine learning does not differentiate clinically-relevant disease categories yet. More diverse datasets, combinations of different types of data, and close clinical integration of machine learning would help to advance the field.
Article
Full-text available
Praxis test is a gesture-based diagnostic test which has been accepted as diagnostically indicative of cortical pathologies such as Alzheimer's disease. Despite being simple, this test is oftentimes skipped by the clinicians. In this paper, we propose a novel framework to investigate the potential of static and dynamic upper-body gestures based on the Praxis test and their potential in a medical framework to automatize the test procedures for computer-assisted cognitive assessment of older adults. In order to carry out gesture recognition as well as correctness assessment of the performances we have recollected a novel challenging RGB-D gesture video dataset recorded by Kinect v2, which contains 29 specific gestures suggested by clinicians and recorded from both experts and patients performing the gesture set. Moreover, we propose a framework to learn the dynamics of upper-body gestures, considering the videos as sequences of short-term clips of gestures. Our approach first uses body part detection to extract image patches surrounding the hands and then, by means of a fine-tuned convolutional neural network (CNN) model, it learns deep hand features which are then linked to a long short-term memory to capture the temporal dependencies between video frames. We report the results of four developed methods using different modalities. The experiments show effectiveness of our deep learning based approach in gesture recognition and performance assessment tasks. Satisfaction of clinicians from the assessment reports indicates the impact of framework corresponding to the diagnosis.
Article
Some forms of mild cognitive impairment (MCI) are the clinical precursors of Alzheimer's disease (AD), while other MCI types tend to remain stable over-time and do not progress to AD. To identify and choose effective and personalized strategies to prevent or slow the progression of AD, we need to develop objective measures that are able to discriminate the MCI patients who are at risk of AD from those MCI patients who have less risk to develop AD. Here, we present a novel deep learning architecture, based on dual learning and an ad hoc layer for 3D separable convolutions, which aims at identifying MCI patients who have a high likelihood of developing AD within 3 years. Our deep learning procedures combine structural magnetic resonance imaging (MRI), demographic, neuropsychological, and APOe4 genetic data as input measures. The most novel characteristics of our machine learning model compared to previous ones are the following: 1) our deep learning model is multi-tasking, in the sense that it jointly learns to simultaneously predict both MCI to AD conversion as well as AD vs. healthy controls classification, which facilitates relevant feature extraction for AD prognostication; 2) the neural network classifier employs fewer parameters than other deep learning architectures which significantly limits data-overfitting (we use ∼550,000 network parameters, which is orders of magnitude lower than other network designs); 3) both structural MRI images and their warp field characteristics, which quantify local volumetric changes in relation to the MRI template, were used as separate input streams to extract as much information as possible from the MRI data. All analyses were performed on a subset of the database made publicly available via the Alzheimer's Disease Neuroimaging Initiative (ADNI), (n = 785 participants, n = 192 AD patients, n = 409 MCI patients (including both MCI patients who convert to AD and MCI patients who do not covert to AD), and n = 184 healthy controls). The most predictive combination of inputs were the structural MRI images and the demographic, neuropsychological, and APOe4 data. In contrast, the warp field metrics were of little added predictive value. The algorithm was able to distinguish the MCI patients developing AD within 3 years from those patients with stable MCI over the same time-period with an area under the curve (AUC) of 0.925 and a 10-fold cross-validated accuracy of 86%, a sensitivity of 87.5%, and specificity of 85%. To our knowledge, this is the highest performance achieved so far using similar datasets. The same network provided an AUC of 1 and 100% accuracy, sensitivity, and specificity when classifying patients with AD from healthy controls. Our classification framework was also robust to the use of different co-registration templates and potentially irrelevant features/image portions. Our approach is flexible and can in principle integrate other imaging modalities, such as PET, and diverse other sets of clinical data. The convolutional framework is potentially applicable to any 3D image dataset and gives the flexibility to design a computer-aided diagnosis system targeting the prediction of several medical conditions and neuropsychiatric disorders via multi-modal imaging and tabular clinical data.
Conference Paper
Engagement in dementia is typically measured using behavior observational scales (BOS) that are tedious and involve intensive manual labor to annotate, and are therefore not easily scalable. We propose AVEID, a low cost and easy-to-use video-based engagement measurement tool to determine the engagement level of a person with dementia (PwD) during digital interaction. We show that the objective behavioral measures computed via AVEID correlate well with subjective expert impressions for the popular MPES and OME BOS, confirming its viability and effectiveness. Moreover, AVEID measures can be obtained for a variety of engagement designs, thereby facilitating large-scale studies with PwD populations.