Conference PaperPDF Available

Recognition Of Atypical Behavior In Autism Diagnosis From Video Using Pose Estimation Over Time

Authors:
RECOGNITION OF ATYPICAL BEHAVIOR IN AUTISM DIAGNOSIS FROM VIDEO USING
POSE ESTIMATION OVER TIME
Kathan Vyas1, Rui Ma1, Behnaz Rezaei1, Shuangjun Liu1, Michael Neubauer2, Thomas Ploetz2,
Ronald Oberleitner2,Sarah Ostadabbas1
1Augmented Cognition Lab (ACLab), Northeastern University, Boston, MA, USA
2Behavior Imaging Solutions, Boise, Idaho, USA
ABSTRACT
Autism spectrum disorder (ASD), similar to many other de-
velopmental or behavioral conditions, is difficult to be pre-
cisely diagnosed. This difficulty increases when the subjects
are young children due to the huge overlap between the ASD
symptoms and typical behaviors of young children. There-
fore, it is important to develop reliable methods that could
help distinguish ASD from normal behaviors of children. In
this paper, we implemented a computer vision based auto-
matic ASD prediction approach to detect autistic character-
istics in a video dataset recorded from a mix of children with
and without ASD. Our target dataset contains 555 videos, out
of which 8349 episodes (each approximately 10 seconds) are
derived. Each episode is labeled as an atypical or typical be-
havior by medical experts. We first estimate children pose in
each video frame by re-training a state-of-the-art human pose
estimator on our manually annotated children pose dataset.
Particle filter interpolation is then applied on the output of the
pose estimator to predict the locations of missing body key-
points. For each episode, we calculate the children motion
pattern defined as the trajectory of their keypoints over time
by temporally encoding the estimated pose maps. Finally, a
binary classification network is trained on the pose motion
representations to discriminate between typical and atypical
behaviors. We were able to achieve a classification accuracy
of 72.4% (precision=0.72 and recall=0.92) on our test dataset.
1. INTRODUCTION
Autism spectrum disorder (ASD) is a type of developmental
disorder that affects human communication and typical behav-
iors in a person at a very early age. According to the diagnos-
tic and statistical manual of mental disorders (DSM-5) [1], a
guide created by the American Psychiatric Association used
to diagnose mental disorders, people with ASD have: (1) dif-
ficulty in communication and interaction; (2) restricted inter-
ests and repetitive behaviors; and (3) symptoms that hurt peo-
ples ability to function properly in school, work, and other
areas of life. Since ASD is developmental, patient’s condition
tends to deteriorate with increasing age. Therefore, an early
detection is important to make the person’s life better. Detect-
ing ASD is difficult in preliminary stages, primarily because
the ASD symptoms and natural behavior of a child overlap
in many fronts. This is evident when we look at the preva-
lence of ASD among kids across the United States. ASD di-
agnosis among states in the US varies significantly; between
5.7 to 21.9 per 1000 eight-year-old children [2]. Due to such
alarming numbers, there is a pressing need for predictive algo-
rithms that can detect ASD behaviors in young children based
on their abnormal behavioral patterns, described in the DSM-
5. Moreover, for any of these algorithms to hold a practical
value in predicting ASD in a larger population, they have to
be both cost-efficient and unobtrusive in order to be employed
in child’s natural living environment for long-term behavioral
monitoring purposes.
1.1. Related Work
In the last couple of decades, research has been growing in
the field of ASD behavior recognition using machine learning
and computer vision. These works have been focusing on un-
derstanding the behavior of the patients, especially children
of various ages, in a number of ways including studying their
behavior through analysing their home videos. In [3], authors
studied infants by extracting what they described as Autism
Observation Scale for Infants (ASOI). In [4], authors quanti-
fied facial expression of the ASD patients and compared them
with the IQ-matched healthy controls. They showed how the
ASD patients fail in using certain facial muscles when ex-
pressing their emotion.
In another study, authors analyzed behavioral interactions
between ASD children and their caregivers/peers using auto-
matically captured video data [5]. Children with ASD have
shown impairment in their ability to produce and perceive
dynamic facial expression, which may contribute to their so-
cial deficits [6]. Researchers have taken this particular fact in
mind to focus on various part of the face including eyes. In
[7], authors proposed an algorithm for fine-grained classifica-
tion of developmental disorders via measurements of individ-
uals’ eye-movements using multimodal (including RGB and
978-1-7281-0824-7/19/$31.00 ©2019 IEEE
Video Frame Pose Estimation
Video Frame Pose Estimation
Video Frame Pose Estimation
Nonlinear State
Estimation
PoTion Image
Generation
ASD Behavior
Classifier
Typical Atypical
Fig. 1. An overview of our ASD behavior detection approach. The first step is to apply the re-trained pose estimation algorithm
on each frame of our video dataset. The resultant pose data may contain some missing keypoints. In order to fill these missing
values, a non-linear state estimation technique is applied on the pose data. The output of the interpolation is then fed to the
PoTion generator which uses 300 frames to provide one PoTion image. Finally, the PoTion images are used as the inputs to
our typical vs. atypical (i.e. ASD) behavior classification network.
eye tracking images) visual data.
All of the above mentioned work used different computer
vision techniques, however none of them focused on studying
full body pose movement over time as an indicator of atypi-
cal behaviors. Most of the research is done around analyzing
RGB frames and pertaining high level information from the
video as well as focusing on specific body parts of children
(e.g. face). In contrast, our focus in this paper is extraction
of the body pose, which is an interpretable, appearance ag-
nostic yet low level latent information from the video frames
and tracking the temporal changes in the pose to understand
behavior of the child.
1.2. Our Contributions
In this paper, we present an automatic ASD behavior predic-
tion approach using computer vision and signal processing
algorithm that can classify various behavior of children based
on display of ASD symptoms. This work is based on a col-
laboration with the Behavioral Imaging company [8]. We ap-
plied our proposed algorithm on 555 of their videos collected
from children with and without ASD, in which each video
was annotated by several expert physicians.
The major contributions in this paper are as follows: (1)
re-training a pre-trained pose estimation network to make it
more robust in identifying children pose; (2) using a non-
linear state estimation technique to predict the locations of
missing keypoints; (3) extracting behavioral features from
keypoints trajectories over a short time frame; and (4) devel-
oping a binary classifier that uses the behavioral features to
classify between atypical vs typical characteristics. The code
for our proposed method will be released in our website.
2. MATERIALS AND METHODS
Our primary goal in this work is to distinguish between typi-
cal and atypical behaviors of children based on their motion
feature representation. The dataset used in our study contains
videos of the children in their home environment. Details on
our dataset are presented in Section 2.1. As shown in Fig. 1,
we implemented a 4-step approach that takes video frames
as input and labels them as either typical or atypical behavior.
Fig. 2. Behavior Imaging company’s Remote Autism assess-
ment platform, called NODA [8].
We started with splitting videos into frames and then used a re-
trained pose estimation network to identify human keypoints
on each frame. 15 keypoints were identified across a body as
explained in Section 2.2. After applying a non-linear interpo-
lation algorithm (i.e. a Particle filter) to fill out the missing
keypoints, we transformed the keypoints information over a
fixed period of time (300 frames) into “PoTion Representa-
tion” [9], which is described in Section 2.3. In Section 2.4, we
trained a classification network on the PoTion representations
to classify them as either a typical (i.e. normal) or atypical
(i.e. ASD) behavior.
2.1. Our Video Dataset
The video dataset used in our experiments comes from NODA
program of Behavior Imaging company [8]. NODA is an
evidence-based Autism diagnostic service platform that has
been brought to market to address some of the challenges with
ASD early diagnosis. Fig. 2presents the preliminary process
of developing the video dataset. Videos of child’s daily ac-
tivities at home were recorded by parents and then sent to
the Behavior Imaging company to be studied by expert physi-
cians. Common categories of daily activities include “play
alone”, “play with others”, “meal time”, and “parents’ con-
cerns”. The annotated dataset provided to us consists of four
sets of videos, denoted by set#1 to set#4, respectively. Length
of videos vary from 2 minutes to 10 minutes. For each video,
physicians tagged several time stamps where they observed
typical or atypical behavior of the child in the video. Details
of our dataset are shown in Table 1.
RoI Align
Classification/Regression Networks
Region Proposal Network
(RPN)
Base Network
(ResNet)
Keypoint Estimation
Network
Human
vs.
Background
Bounding Box
Coordinates
Coordinates of
15 Keypoints
Fig. 3. Architecture of our pose estimator adopted from 2D
Mask R-CNN network [10]. It can perform multi-person pose
estimation by identifying 15 keypoints on each human body
in the input image.
2.2. Body Pose Estimation in Children
For each typical or atypical time stamp in a video (provided by
ASD experts), we take 8 seconds before and 2 seconds after
that time stamp to construct a 10-second episode. The reason
is that physicians usually labeled a typical or atypical time
stamp after they observed the behavior for a while, and a 10-
second window could approximately record the behavior in
full. All of our videos are 30 frames per second, so the length
of the episode would be 301 frames (including one frame at
the time stamp) at most, or fewer if a time stamp is labeled at
the very start or end of the video.
We adopted the state-of-the-art human pose estimation
network, the 2D Mask R-CNN [10], by re-training it on our
manually annotated children pose dataset. We then run the
re-trained pose estimator on each frame of every 10-second
episode to identify human keypoints in it. The pose estima-
tor is capable of detecting multiple people on a single im-
age and identifying their 15 body keypoints, including upper-
head, nose, upper-neck, left/right shoulders, left/right elbows,
left/right wrists, left/right hips, left/right knees, and left/right
ankles across every detected human body. The pose estima-
tor extends the standard architecture of Mask R-CNN [11], as
shown in Fig. 3.
ResNet is used as the base network to extract low-level
feature maps from an input image. Next, region proposal net-
work (RPN) generates a number of region of interests (RoIs)
from these feature maps. RoI Align operations are then added
to the RoIs in order to refine them. The refined RoIs are
fed into classification and regression networks that perform
bounding box detection. For every input RoI, the regression
network computes the coordinates of the upper left and lower
right corners of one bounding box, while the classification net-
work classifies the bounding box of being “human” or “back-
ground”. Apart from Mask R-CNN’s basic architecture, the
Table 1.Statistics of our NODA video dataset.
Set#1 Set#2 Set#3 Set#4 Total
No. Video Clips 235 101 177 42 555
Total No. of Time Stamps 4465 2048 1545 291 8349
No. of Typical Stamps 1749 681 529 125 3084
No. of Atypical Stamps 2716 1367 1016 166 5265
t=1 t=T
t=1 t=T
t=1 t=T
Temporal
Aggregation
Nose
Upper-neck
Upper-Head
Left-shoulder
Right-Knee
Right-Wrist
Right-Elbow
Left-hip
Right-Hip
Temporal
Aggregation
Fig. 4.An illustration of the updated version of PoTion Rep-
resentation, first introduced in [9]. We used only three colors
with different intensities as time variability in each denoting
parts of the body. Red color represents all keypoints above
shoulder including nose, top of the head and neck. For shoul-
ders, elbows, and wrists, we used green representation. The
keypoints below the abdomen including hips, knees, and an-
kles are represented in blue.
pose optimizer replaces the instance segmentation network
with a keypoint estimation network in parallel to the classifi-
cation and regression networks. The keypoint estimation net-
work could predict the coordinates of 15 keypoints for every
detected human. The pose of a human could be visualized
by connecting these 15 keypoints together, as is shown in the
output of the keypoint estimation network. After the pose es-
timation step, in order to fill the missed keypoints due to the
estimation failure, we apply a non-linear state estimation tech-
nique. Details of the re-training and state estimation steps are
provided in Section 3.
2.3. PoTion Representation
After filling out the missing keypoints, we transform the pose
information in each episode into Pose Motion (PoTion) Rep-
resentation, a novel representation that gracefully encodes the
movements of human keypoints [9]. In this work, we changed
the colorization step of the original PoTion Representation, so
that the final output will be one RGB image for all keypoints
instead of one RGB image per each keypoint. As is illustrated
in Fig. 4, a Gaussian kernel is put around each keypoint to ob-
tain a keypoint heatmap at first. The Gaussian kernel has a
fixed size of 12 ×12 with variance of 1.0. Next, we colorize
every heatmap according to the body parts it belongs to and
the relative time of the frame in the episode. The main idea
is to colorize the heatmaps on upper-head, nose, and upper-
neck in red, left/right shoulders, elbows, and wrists in green,
left/right hips, knees, and ankles in blue, while the color in-
tensity goes from weak to strong from the first frame to the
last frame over a certain episode. The last step is to aggregate
these colorized heatmaps over time to obtain the trajectories
of all keypoints, which would be one RGB image. Size of this
image is warped to 256×256×3.
Cov2D: 3*3*64
Input: 256*256*3
Sigmoid
Max Pool: 2*2
Fully connected Layer
Output
Flatten
Fig. 5. Architecture of the CNN classifier. It consists of a
convolution layer of 64 filters with size 3×3, stride of 1, and
same padding, a max-pooling layer with 2×2filters, stride
of 2, followed by flattened layer and a fully-connected layer
with 64 neurons, and the sigmoid operation to get the labels.
One of the markers that physicians took care of while tag-
ging a video is the child’s interaction with surrounding people.
In order to properly measure the interaction between the child
and other people around the child, we decided to take the pose
information of all detected people in a frame into PoTion Rep-
resentation instead of only taking the pose of the child.
2.4. Behavior Classification Network
After we obtain the PoTion Representations, we train a sim-
ple convolutional neural network (CNN) to classify them as
either a typical or atypical behavior. Architecture of the em-
ployed CNN is shown in Fig. 5. The CNN consists of a con-
volution layer, a max-pooling layer, followed by a flattened
layer and a fully-connected layer, and then the sigmoid func-
tion to get the classes. We split the whole dataset into 80% of
training and 20% of test. We used the Adam optimizer with
fixed learning rate of 0.01. In order to prevent our model from
over-fitting, we adopted early stopping technique as well as
the well-known 5-fold cross-validation during training. Mul-
tiple performance metrics, including accuracy, precision, re-
call, and AUC, were used to evaluate the classification perfor-
mance. AUC is an important metric in binary classification
tasks. It is the area under the Receiver Operating Character-
istics (ROC) curve, which is plotted as the true positive rate
against the false positive rate. This curve reflects the ability of
a model in distinguishing positive and negative classes. Value
of AUC is between 0 and 1, and the higher the AUC, the bet-
ter the model would be in the classification task. All modeling
results are evaluated in Section 4.
3. DETAILS ON ALGORITHM IMPROVEMENTS
The PoTion Representation could encode the movements of
human keypoints into colorized trajectories, visualizing the
behavior of a child within a short time period. However,
these trajectories may be non-consecutive if the pose esti-
mator failed at detecting the keypoints in some of the input
frames. In this section, we explore two methods in order to
address this issue by re-training the human pose estimator net-
work on the target dataset and also by applying interpolation
techniques to predict the locations of missing keypoints.
3.1. Re-training Pose Estimation Network
The Mask R-CNN pose estimator is initialized using Ima-
geNet, pre-trained on COCO keypoint detection task, and fine-
tuned on PoseTrack dataset [10]. Although, Mask R-CNN
has achieved state-of-the-art pose estimation performance on
these datasets, however it provides poor results on our video
dataset, where the subjects are children rather than adult hu-
man. The reason is that children have different body propor-
tions and skeleton structures compared to adults, while very
few children are present in the training sets of the Mask R-
CNN. As a result, the pose estimator tends to ignore the pres-
ence of children by classifying the bounding boxes with chil-
dren as “background”.
In order to improve the performance of the pose estimator,
we took 985 sample frames with presence of children from
our video datasets and manually annotated human keypoints
on these frames. We developed an improved version of a semi-
automatic annotation toolbox provided in [12]. With the help
of this toolbox, we could get the coordinates of 15 keypoints
for each person in an image by clicking on the corresponding
body parts. The toolbox also supports bounding boxes anno-
tation, from which we could obtain the upper left and lower
right coordinates of head and body bounding boxes. We con-
verted the keypoints and bounding boxes coordinates to the
JSON format of PoseTrack dataset [13]. We then re-trained
the keypoint estimation network of the pose estimator using
these annotated frames. Effectiveness of the re-training step
is evaluated in Section 4.
3.2. Non-linear Interpolation using Particle Filter
Once we re-trained the pose estimation network, it has be-
come more robust in identifying children and their poses.
However, there might still be several frames in an episode
that were not annotated with pose due to various reasons,
including blurring, shading, and other image quality issues.
In order to obtain a complete set of keypoints’ trajectories,
we decided to implement a particle filter as a non-linear in-
terpolation approach to predict the coordinates of the missing
keypoints in these frames [14]. Particle filter is a sequential
Monte Carlo method based on point mass (or “particle”) rep-
resentations of the probability densities. It generalizes the
classical Kalman filtering methods and can be applied to any
state-space model [14]. Let us consider a single episode for
an instance. We count the number of frames which are not
annotated (this means no people are detected in these frames)
in the episode and denote it as M, while the total number
of frames in an episode is denoted as N. If Mis more than
90% of N, we discard the episode completely. For remaining
cases, we apply the particle filter interpolation on them.
The particle filter interpolation is initialized by generating
a set of Gaussian distributed particles over the whole frame.
Fig. 6. The implementation of our interpolation step: (a)
groundtruth for the trajectory of one keypoint across 301
frames, (b) the output trajectory of keypoint from the pose
estimator, (c) the output of linear interpolation done to fill the
missing keypoints, and (d) the output after the particle filter
interpolation.
In our case, we considered 300 particles, a value considering
the trade-off between speed and accuracy. We assigned four
landmarks at four corners of the frame. The distance of a
keypoint ifrom these landmarks are taken as measurements Z
for each iteration, such that Z=!(yLyi)2+(xLxi)2,
where xLand yLare the coordinate of one of the landmarks
and xiand yiare the coordinate of the keypoint i.
We use the combination of the distance Dand orientation
Oof the keypoint ibetween one frame jto the following
frame j+1, as the state variables during each iteration:
D="(yj+1
iyj
i)2+(xj+1
ixj
i)2(1)
O= arctan ( yj+1
iyj
i
xj+1
ixj
i
)(2)
The whole process of particle filter includes prediction
and updation of particles done repeatedly over Niterations.
We update Z,D, and Oafter every iteration. After the initial-
ization step, we update the positions of particles as per state
variables (Dand O). We continue by comparing the distance
between each particle and the landmarks with Z. We assign
a weight to each particle based on this comparison in a way
where more weight is given to the particles that are closer to
the keypoint. This iterative process continues until we get the
positions of particles in the last frame. At the end, we take
mean of each particle distribution to find the expected value
of the corresponding keypoint.
The output after the particle filter interpolation step is an
episode without missing keypoints. This episode is ready to
be used as the input for the PoTion step. The effect of the
particle filter interpolation is shown in Fig. 6, where we com-
pared the keypoints’ trajectories under three different meth-
ods, including no interpolation, linear interpolation, and our
particle filter interpolation, against the groundtruth trajectory.
It can be seen that the imputed trajectory done by particle filter
interpolation is the closest to the groundtruth. We also com-
pared the performance of the behavior classification model
under these 3 interpolation methods in Section 4.
Table 2.Percentage (%) of annotated frames after applying pose estima-
tion before and after re-training.
Set#1 Set#2 Set#3 Set#4
Before 80.6 83.6 80.0 87.5
After 94.6 96.7 94.6 96.3
4. EXPERIMENTAL RESULTS
In this section, we evaluated the performance of our typical
vs. atypical behavior detection method on our video dataset
based on different model configurations. We first evaluated
the effectiveness of the re-training step by running the Mask
R-CNN pose estimation, with and without re-training, on
set#1 to set#4 videos, and then compared the percentage of
the frames that were successfully annotated with human pose.
As can be seen from Table 2, re-training the pose estimator
results in 9% to 14% increase in the number of annotated
frames.
Table 3.Performance evaluation the behavior classification network un-
der pre-processing conditions of no missing pose imputation (None), linear
interpolation (Linear), and particle filter imputation.
Interpolation/Metrics Accuracy Precision Recall AUC
None 0.72 0.72 0.93 0.64
Linear 0.72 0.72 0.92 0.66
Particle Filter 0.72 0.72 0.93 0.66
We evaluated the effect of particle filter interpolation by
comparing it with two other imputation methods, no key-
points interpolation (None) and linear interpolation (Linear).
Table 3compares the performance metrics of our classifica-
tion model on test data under these pre-processing conditions.
All metrics in the table are derived after the pose estimator
is re-trained. It can be seen that the classifier benefits a lot
from interpolation, and particle filter interpolation method
achieves the best overall performance.
We also compared the accuracy of ASD vs. normal be-
havior classification between our proposed pose estimation-
based approach with a conventional video classification ap-
proach that takes in RGB images instead of pose information.
We implemented a modified version of the single-frame ap-
proach proposed in [15]. We split each episode into frames
and labeled these frames as typical or atypical according to
the class of the episode. Then, we fine-tuned an Inception
V3 network as classifier using these labeled frames. Next, for
each episode we predicted the classes of all the frames in it us-
ing this classifier and then performed a majority voting to get
the class of this episode. Performance results of this modified
single-frame approach, as well as our proposed pose estima-
tion based approach are compared in Table 4. As you can see
our approach outperforms the conventional video classifica-
Table 4.Comparison of the modified single-frame approach and our pro-
posed pose estimation based approach.
Approach/Metrics Accuracy Precision Recall AUC
Modified single-frame [15] 0.66 0.70 0.77 0.62
Our pose estimation-based 0.72 0.72 0.93 0.66
tion approach.
5. CONCLUSION AND FUTURE WORK
Through this paper, we presented a novel behavior classifica-
tion approach which uses low level latent information in the
form of body pose over time to categorize a given video clip
into a typical (normal) or atypical (ASD) behavior. We started
with re-training a state-of-the-art pose estimator on our video
dataset to obtain accurate keypoint estimation for children.
This is done using a semi-automatic annotation tool, which
enables researchers to create their own annotated datasets. We
used particle filter interpolation to fill the missing keypoints
after the pose estimation. Using the extracted keypoints, we
transferred approximately 301 frames in each episode into a
PoTion representation, forming a single RGB image. These
images capture changes in 15 keypoints on whole body across
the time of 10 seconds (length of one episode). We uses
the PoTion images as inputs to our binary classifier for final
behavior classification. We achieve better performance com-
pared to a conventional video classification method.
Our primary objective in developing such approach was
to use the movements of the body keypoints as a an indicator
of behavior in a person. The present approach can be im-
proved by combining RGB and audio modalities in order to
get more accurate results. For future considerations, we plan
to develop a new type of PoTion Representation that works
with 3D trajectories and could add more spatial information
into the classification pipeline.
6. REFERENCES
[1] American Psychiatric Association, Diagnostic and sta-
tistical manual of mental disorders (5th ed.), American
Psychiatric Association, 2013. 1
[2] Jon Baio, “Morbidity and mortality weekly report.
surveillance summaries, Epidemiology Program Office,
Centers for Disease Control and Prevention, vol. 63, no.
2, 2014. 1
[3] Jordan Hashemi, Thiago Vallin Spina, Mariano Tep-
per, Amy Esler, Vassilios Morellas, Nikolaos Pa-
panikolopoulos, and Guillermo Sapiro, A computer
vision approach for the assessment of autism-related be-
havioral markers, 2012 IEEE International Conference
on Development and Learning and Epigenetic Robotics
(ICDL), pp. 1–7, 2012. 1
[4] Michael L Spezio, Ralph Adolphs, Robert SE Hurley,
and Joseph Piven, Abnormal use of facial information
in high-functioning autism,” Journal of autism and de-
velopmental disorders, vol. 37, pp. 929–939, 2007. 1
[5] James Rehg, “Behavior imaging: Using computer vi-
sion to study autism.,” MVA, vol. 11, pp. 14–21, 2011.
1
[6] David Deriso, Joshua Susskind, Lauren Krieger, and
Marian Bartlett, “Emotion mirror: a novel intervention
for autism based on real-time expression recognition,
European Conference on Computer Vision, pp. 671–674,
2012. 1
[7] Guido Pusiol, Andre Esteva, Scott S Hall, Michael
Frank, Arnold Milstein, and Li Fei-Fei, “Vision-
based classification of developmental disorders using
eye-movements,International Conference on Medical
Image Computing and Computer-Assisted Intervention,
pp. 317–325, 2016. 1
[8] “Behavior Imaging – Health & Education Assess-
ment Technology, https://behaviorimaging.
com/, Accessed: 2019. 2
[9] Vasileios Choutas, Philippe Weinzaepfel, Jérôme Re-
vaud, and Cordelia Schmid, “Potion: Pose motion
representation for action recognition,” Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7024–7033, 2018. 2,3
[10] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,
Manohar Paluri, and Du Tran, “Detect-and-track: Ef-
ficient pose estimation in videos,” 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pp.
350–359, 2018. 3,4
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and
Ross B. Girshick, “Mask r-cnn,” 2017 IEEE Inter-
national Conference on Computer Vision (ICCV), pp.
2980–2988, 2017. 3
[12] Shuangjun Liu and Sarah Ostadabbas, A semi-
supervised data augmentation approach using 3d graphi-
cal engines,” European Conference on Computer Vision,
pp. 395–408, 2018. 4
[13] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov,
Leonid Pishchulin, Anton Milan, Juergen Gall, and
Bernt Schiele, “Posetrack: A benchmark for human
pose estimation and tracking,” Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, pp. 5167–5176, 2018. 4
[14] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon,
and Tim Clapp, A tutorial on particle filters for online
nonlinear/non-gaussian bayesian tracking,” IEEE Trans-
actions on signal processing, vol. 50, no. 2, pp. 174–188,
2002. 4
[15] Andrej Karpathy, George Toderici, Sanketh Shetty,
Thomas Leung, Rahul Sukthankar, and Li Fei-Fei,
“Large-scale video classification with convolutional neu-
ral networks, Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, pp. 1725–
1732, 2014. 5
... Interactive devices such as smartphones are increasingly being evaluated for use in at-home digital diagnostics and therapies which adapt to the needs of the child [31]. For diagnostics, computer vision classifiers measure the behavior of the child in real-time, and statistical measurements are used to provide a diagnosis or screening [10,43,55]. For digital therapeutics, the classifiers measure the response of the child to an interactive game designed to provide behavioral therapy for the child [8-9, 24, 41, 42, 50]. ...
... Some research efforts have worked towards classification of autism-related motor movements from unstructured video clips. Vyas et al. use a 2D mask R-CNN to distinguish autism from neurotypical behavior in unstructured video clips from a private data set, reporting precision of 72% and recall of 92% [43]. ...
Preprint
Full-text available
Activity recognition computer vision algorithms can be used to detect the presence of autism-related behaviors, including what are termed "restricted and repetitive behaviors", or stimming, by diagnostic instruments. The limited data that exist in this domain are usually recorded with a handheld camera which can be shaky or even moving, posing a challenge for traditional feature representation approaches for activity detection which mistakenly capture the camera's motion as a feature. To address these issues, we first document the advantages and limitations of current feature representation techniques for activity recognition when applied to head banging detection. We then propose a feature representation consisting exclusively of head pose keypoints. We create a computer vision classifier for detecting head banging in home videos using a time-distributed convolutional neural network (CNN) in which a single CNN extracts features from each frame in the input sequence, and these extracted features are fed as input to a long short-term memory (LSTM) network. On the binary task of predicting head banging and no head banging within videos from the Self Stimulatory Behaviour Dataset (SSBD), we reach a mean F1-score of 90.77% using 3-fold cross validation (with individual fold F1-scores of 83.3%, 89.0%, and 100.0%) when ensuring that no child who appeared in the train set was in the test set for all folds. This work documents a successful technique for training a computer vision classifier which can detect human motion with few training examples and even when the camera recording the source clips is unstable. The general methods described here can be applied by designers and developers of interactive systems towards other human motion and pose classification problems used in mobile and ubiquitous interactive systems.
... Pose estimation and activity recognition have also been utilized to detect self-stimulatory behaviors. Vyas et al. retrained a 2D Mask R-CNN [55] to get the coordinates of 15 keypoints that were then transformed into a Pose Motion (PoTion) representation [56] and finally fed to a CNN model for a prediction of autism-related behavior [57]. This approach resulted in a 72.4% classification accuracy with 72% precision and 92% recall. ...
Preprint
Full-text available
A formal autism diagnosis is an inefficient and lengthy process. Families often have to wait years before receiving a diagnosis for their child; some may not receive one at all due to this delay. One approach to this problem is to use digital technologies to detect the presence of behaviors related to autism, which in aggregate may lead to remote and automated diagnostics. One of the strongest indicators of autism is stimming, which is a set of repetitive, self-stimulatory behaviors such as hand flapping, headbanging, and spinning. Using computer vision to detect hand flapping is especially difficult due to the sparsity of public training data in this space and excessive shakiness and motion in such data. Our work demonstrates a novel method that overcomes these issues: we use hand landmark detection over time as a feature representation which is then fed into a Long Short-Term Memory (LSTM) model. We achieve a validation accuracy and F1 Score of about 72% on detecting whether videos from the Self-Stimulatory Behaviour Dataset (SSBD) contain hand flapping or not. Our best model also predicts accurately on external videos we recorded of ourselves outside of the dataset it was trained on. This model uses less than 26,000 parameters, providing promise for fast deployment into ubiquitous and wearable digital settings for a remote autism diagnosis.
... However, the recorded data can be only used to analyze the interactive skills of children rather than their characteristic physical behavior. The evolution of body poses of ASD patients over time is investigated by [32] to distinguish typical from atypical behaviors using low-level latent information. Their collected dataset included daily living activities of autistic children obtained by means of the NODA platform of Behavior Imaging Company [33]. ...
Article
Medical diagnosis supported by computer-assisted technologies is getting more popularity and acceptance among medical society. In this paper, we propose a non-intrusive vision-assisted method based on human action recognition to facilitate the diagnosis of Autism Spectrum Disorder (ASD). We collected a novel and comprehensive video dataset of the most distinctive Stereotype actions of this disorder with the assistance of professional clinicians. Several frameworks as a function of different input modalities were developed and used to produce extensive baseline results. Various local descriptors, which are commonly used within the Bag-of-Visual-Words approach, were tested with Multi-layer Perceptron (MLP), Gaussian Naive Bayes (GNB), and Support Vector Machines (SVM) classifiers for recognizing ASD associated behaviors. Additionally, we developed a framework that first receives articulated pose-based skeleton sequences as input and follows an LSTM network to learn the temporal evolution of the poses. Finally, obtained results were compared with two fine-tuned deep neural networks: ConvLSTM and 3DCNN. The results revealed that the Histogram of Optical Flow (HOF) descriptor achieves the best results when used with MLP classifier. The promising baseline results also confirmed that an action-recognition-based system can be potentially used to assist clinicians to provide a reliable, accurate, and timely diagnosis of ASD disorder.
... They demonstrated that it is possible to determine whether a video sequence contains grasping action performed by ASD or TD subjects. In another study, Vyas et al. 83 estimated children's pose over time by retraining a state-ofthe-art pose estimator (2D Mask R-CNN) and trained a CNN to categorise whether a given video clip contains a typical (normal) or atypical (ASD) behaviour. Their approach with an accuracy of 72% outperformed conventional video classification approaches. ...
Article
Full-text available
The current state of computer vision methods applied to autism spectrum disorder (ASD) research has not been well established. Increasing evidence suggests that computer vision techniques have a strong impact on autism research. The primary objective of this systematic review is to examine how computer vision analysis has been useful in ASD diagnosis, therapy and autism research in general. A systematic review of publications indexed on PubMed, IEEE Xplore and ACM Digital Library was conducted from 2009 to 2019. Search terms included [‘autis*’ AND (‘computer vision’ OR ‘behavio* imaging’ OR ‘behavio* analysis’ OR ‘affective computing’)]. Results are reported according to PRISMA statement. A total of 94 studies are included in the analysis. Eligible papers are categorised based on the potential biological/behavioural markers quantified in each study. Then, different computer vision approaches that were employed in the included papers are described. Different publicly available datasets are also reviewed in order to rapidly familiarise researchers with datasets applicable to their field and to accelerate both new behavioural and technological work on autism research. Finally, future research directions are outlined. The findings in this review suggest that computer vision analysis is useful for the quantification of behavioural/biological markers which can further lead to a more objective analysis in autism research.
Article
Background A formal autism diagnosis can be an inefficient and lengthy process. Families may wait several months or longer before receiving a diagnosis for their child despite evidence that earlier intervention leads to better treatment outcomes. Digital technologies that detect the presence of behaviors related to autism can scale access to pediatric diagnoses. A strong indicator of the presence of autism is self-stimulatory behaviors such as hand flapping. Objective This study aims to demonstrate the feasibility of deep learning technologies for the detection of hand flapping from unstructured home videos as a first step toward validation of whether statistical models coupled with digital technologies can be leveraged to aid in the automatic behavioral analysis of autism. To support the widespread sharing of such home videos, we explored privacy-preserving modifications to the input space via conversion of each video to hand landmark coordinates and measured the performance of corresponding time series classifiers. Methods We used the Self-Stimulatory Behavior Dataset (SSBD) that contains 75 videos of hand flapping, head banging, and spinning exhibited by children. From this data set, we extracted 100 hand flapping videos and 100 control videos, each between 2 to 5 seconds in duration. We evaluated five separate feature representations: four privacy-preserved subsets of hand landmarks detected by MediaPipe and one feature representation obtained from the output of the penultimate layer of a MobileNetV2 model fine-tuned on the SSBD. We fed these feature vectors into a long short-term memory network that predicted the presence of hand flapping in each video clip. Results The highest-performing model used MobileNetV2 to extract features and achieved a test F1 score of 84 (SD 3.7; precision 89.6, SD 4.3 and recall 80.4, SD 6) using 5-fold cross-validation for 100 random seeds on the SSBD data (500 total distinct folds). Of the models we trained on privacy-preserved data, the model trained with all hand landmarks reached an F1 score of 66.6 (SD 3.35). Another such model trained with a select 6 landmarks reached an F1 score of 68.3 (SD 3.6). A privacy-preserved model trained using a single landmark at the base of the hands and a model trained with the average of the locations of all the hand landmarks reached an F1 score of 64.9 (SD 6.5) and 64.2 (SD 6.8), respectively. Conclusions We created five lightweight neural networks that can detect hand flapping from unstructured videos. Training a long short-term memory network with convolutional feature vectors outperformed training with feature vectors of hand coordinates and used almost 900,000 fewer model parameters. This study provides the first step toward developing precise deep learning methods for activity detection of autism-related behaviors.
Conference Paper
Full-text available
Facial expression perception and production are crucial for social functioning. Children with autism spectrum disorders (ASD) are impaired in their ability to produce and perceive dynamic facial expressions, which may contribute to social deficits. Here we present a novel intervention system for improving facial expression perception and production in children with ASD based on computer vision. We present a live demo of the Emotion Mirror, a game where the children make facial expressions of basic emotions (anger, disgust, fear, happiness, sadness, and surprise) that are "mirrored" by a cartoon character on the screen who responds dynamically in real-time. In the reverse mirror condition, the character makes an expression and children are rewarded when they successfully copy the expression of the character. This application demonstrates a novel intersection of computer vision and medicine enabled by real-time facial expression processing.
Chapter
Deep learning approaches have been rapidly adopted across a wide range of fields because of their accuracy and flexibility, but require large labeled training datasets. This presents a fundamental problem for applications with limited, expensive, or private data (i.e. small data), such as human pose and behavior estimation/tracking which could be highly personalized. In this paper, we present a semi-supervised data augmentation approach that can synthesize large scale labeled training datasets using 3D graphical engines based on a physically-valid low dimensional pose descriptor. To evaluate the performance of our synthesized datasets in training deep learning-based models, we generated a large synthetic human pose dataset, called ScanAva using 3D scans of only 7 individuals based on our proposed augmentation approach. A state-of-the-art human pose estimation deep learning model then was trained from scratch using our ScanAva dataset and could achieve the pose estimation accuracy of 91.2% at PCK0.5 criteria after applying an efficient domain adaptation on the synthetic images, in which its pose estimation accuracy was comparable to the same model trained on large scale pose data from real humans such as MPII dataset and much higher than the model trained on other synthetic human dataset such as SURREAL.
Article
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge.
Conference Paper
This paper proposes a system for fine-grained classification of developmental disorders via measurements of individuals’ eye-movements using multi-modal visual data. While the system is engineered to solve a psychiatric problem, we believe the underlying principles and general methodology will be of interest not only to psychiatrists but to researchers and engineers in medical machine vision. The idea is to build features from different visual sources that capture information not contained in either modality. Using an eye-tracker and a camera in a setup involving two individuals speaking, we build temporal attention features that describe the semantic location that one person is focused on relative to the other person’s face. In our clinical context, these temporal attention features describe a patient’s gaze on finely discretized regions of an interviewing clinician’s face, and are used to classify their particular developmental disorder.
Article
Computer vision technology has a unique opportunity to impact that study of children's behavior, by providing a means to automatically capture behavioral data in an noninvasive manner and analyze behavioral interactions between children and their caregivers and peers. We briefly outline a research agenda in Behavior Imaging, which targets the capture and analysis of social and communicative behaviors. We present illustrative results from an on-going project on the content-based retrieval of social games between children and adults from an unstructured video corpus.
Conference Paper
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
Conference Paper
The early detection of developmental disorders is key to child outcome, allowing interventions to be initiated that promote development and improve prognosis. Research on autism spectrum disorder (ASD) suggests behavioral markers can be observed late in the first year of life. Many of these studies involved extensive frame-by-frame video observation and analysis of a child's natural behavior. Although non-intrusive, these methods are extremely time-intensive and require a high level of observer training; thus, they are impractical for clinical purposes. Diagnostic measures for ASD are available for infants but are only accurate when used by specialists experienced in early diagnosis. This work is a first milestone in a long-term multidisciplinary project that aims at helping clinicians and general practitioners accomplish this early detection/measurement task automatically. We focus on providing computer vision tools to measure and identify ASD behavioral markers based on components of the Autism Observation Scale for Infants (AOSI). In particular, we develop algorithms to measure three critical AOSI activities that assess visual attention. We augment these AOSI activities with an additional test that analyzes asymmetrical patterns in unsupported gait. The first set of algorithms involves assessing head motion by facial feature tracking, while the gait analysis relies on joint foreground segmentation and 2D body pose estimation in video. We show results that provide insightful knowledge to augment the clinician's behavioral observations obtained from real in-clinic assessments.