Content uploaded by Hossen A. Mustafa
Author content
All content in this area was uploaded by Hossen A. Mustafa on Apr 07, 2021
Content may be subject to copyright.
Multi-Level Feature Fusion for Robust Pose-Based
Gait Recognition using RNN
Md Mahedi Hasan
Institute of Information and Communication Technology
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh
mahedi0803@gmail.com
Hossen Asiful Mustafa
Institute of Information and Communication Technology
Bangladesh University of Engineering and Technology
Dhaka, Bangladesh
hossen_mustafa@iict.buet.ac.bd
Abstract—Recognizing individual people from gait is still a
challenging problem in computer vision research due to the
presence of various covariate factors like varying view angle,
change in clothing, walking speed, and load carriage, etc. Most of
the earlier works are based on human silhouettes which have
proven to be efficient in recognition but are not invariant to change
in illumination and clothing. In this research, to address this
problem, we present a simple yet effective approach for robust gait
recognition using a recurrent neural network (RNN). Our RNN
network with GRU architecture is very powerful in capturing the
temporal dynamics of the human body pose sequence and perform
recognition. We also design a low-dimensional gait feature
descriptor based on the 2D coordinates of human pose information
which is proven to be not only invariant to various covariate factors
but also effective in representing the dynamics of various gait
pattern. The experimental results on challenging CASIA A and
CASIA B gait datasets demonstrate that the proposed method has
achieved state-of-the-art performance on both single-view and
cross-view gait recognition which proves the effectiveness of our
method.
Keywords-component; gait recognition; biometrics; visual
surveillance; rnn;
I. INTRODUCTION
Biometrics refers to automatic identification or
authentication of people by analyzing their physiological and
behavioral characteristics. Physiological biometrics is related to
the shape of body parts such as face, fingerprints, shape of the
hand, iris, retina, etc., which are not subject to change due to
aging. It is now used as the most stable means for authenticating
and identifying people in a reliable way. However, for efficient
and accurate authentication, these traits require cooperation from
the subject along with a comprehensive controlled environmental
setup. Hence, these traits are not useful in surveillance systems.
Behavioral biometrics such as signatures, gestures, gait, and
voice, etc., is related to a person’s behavior. But, these traits are
more prone to change depending on factors such as aging,
injuries, or even mood.
Gait recognition is a behavioral biometric modality that
identifies a person based on walking posture of that person. In
contrast to other biometrics such as face and fingerprint, it is a
non-invasive technique for identifying an individual which is
hard to copy. A unique advantage of gait as a biometric is that it
offers recognition at a distance and at low-resolution images;
consequently, gait biometric signature is now considered as the
only likely identification technology suitable for access control,
covert video surveillance, criminal investigation, and forensic
analysis and the method is not vulnerable to spoofing attacks and
signature forgery.
Due to the advantages of gait recognition, the past two
decades have witnessed significant improvements of its
algorithms. However, unfortunately, there still exist many
challenges that need to be addressed for robust gait recognition.
It has been observed that the performance of gait recognition is
highly affected by different intraclass variations in people's
appearance such as clothing and carrying variation, and in
environment such as variations in illumination, walking surface,
and view angle, etc. These factors can drastically reduce the
performance of gait recognition. Traditional body appearance-
based methods in gait recognition are sensitive to these covariate
factors since the extraction of human silhouettes is affected by
by the changes in lighting. Moreover, when the shape of the
human body and appearance change substantially, the
performance of appearance-based methods severely degrades.
Therefore, these methods are not completely robust toward these
covariate change.
In contrast, model-based gait recognition exploits features
based on the shape of human body parts and the dynamics of the
motion of each of these parts. The salient advantage of the
model-based approach is that, as opposed to silhouette-based
approaches, it can efficiently handle many covariate changes
such as view angle, body appearance and shape, so, these
methods show robustness toward these variations. But, due to
heavy computational cost required for modeling human body
accurately, these methods were not as popular as appearance-
based ones.
Modern deep learning-based algorithms have recently gained
increasing popularity while achieving outstanding performance
in many computer vision tasks like person re-identification, pose
estimation, and action recognition, etc. Furthermore,
advancement on human body pose estimation can significantly
assist in accurately modeling different body parts required for
model-based gait recognition. On the other hand, recurrent neural
networks have also achieved a promising performance in many
sequence labeling tasks. The reason behind their effectiveness
for sequence-based tasks lies in their ability to capture long-range
dependencies in a temporal context from sequence. RNNs have
been successfully employed to achieve state-of-the-art results in
many vision-based tasks like human emotion detection and
action recognition.
In this work, we propose a model-based gait recognition
method where we consider human 2D pose data as our effective
gait features. As body pose is proven not to depend on people
body appearance and shape, and is invariant to change of clothing
and carrying conditions. Additionally, as gait can be considered
a time series of walking postures, body pose information has a
powerful capacity to capture the temporal pattern of gait.
Therefore, the proposed method will be less affected by the
variation of covariate factors. It is also worth mentioning that, in
this work we didn’t use 3D pose data as our gait feature: firstly,
computing 3D poses is computationally expensive, and secondly,
most of the 3D pose estimation algorithms from 2D RGB images
often require multiple views, and hence multiple cameras,
rendering the technique unsuitable for surveillance. Again,
recovering 3D pose from a single RGB images is an ill-posed
problem and often causes large pose estimation errors.
Compared to other gait covariate, view is the most important
factor severely affecting gait recognition performance. To handle
view variation efficiently, gait algorithms have generally been
studied under three experimental setups: single-view, multi-view,
and cross-view setup. In single-view gait recognition, both probe
and gallery gaits are kept within same view angle, where in cross-
view gait recognition, the probe and gallery gaits are kept in
different views; and in multi-view gait recognition, multiple
views of gallery gaits are combined to recognize a probe gait
under a specific view.
Thus, the key to our proposed method is to develop a pose-
based deep recurrent neural network for robust gait recognition
by modeling the temporal dynamics associated with human gait.
Most of the descriptors proposed in the literature for gait
recognition often lead to a high dimensional feature space, which
can be computationally expensive to map. In this work, we
designed a lower dimensional spatio-temporal feature descriptor
from 2D pose estimation for improved performance at a reduced
computational cost. Our gait descriptor is a concatenation of four
different kinds of features which are robust to view variation. We
demonstrate the effectiveness of our proposed method through
extensive experiments on two public benchmark datasets: the
CASIA A and CASIA B gait dataset [1]. Our method achieved
state-of-the-art performance on these two challenging gait
datasets in both single-view and cross-view recognition,
providing better results as compared to other methods proposed
in the literature.
The main contributions of our work are summarized as
follows:
• We propose a novel RNN network with GRU
architecture and devise several strategies to effectively
train the network for robust gait recognition.
• The proposed pose-based RNN network achieves the
best results on two challenging benchmark datasets
CASIA A and CASIA B by outperforming other
prevailing methods in single-view gait recognition at a
significant margin.
• We consider 2D coordinates of body pose to design a
novel gait feature descriptor which is invariant to
covariate factors and achieved comparable performance
to the methods which require to calculate gait energy
image (GEI) or expensive 3D poses for gait descriptors.
The remaining part of this paper is organized as follows:
Section II reviews the existing literature gait recognition, while
Section III briefly describes our proposed network along with the
steps regarding data collection, preprocessing and network
architecture for gait recognition. Experiments and evaluation are
presented in Section IV. Finally, we summarize our results in
Section V.
II. RELATED WORKS
Over the last two decades, several methods have been studied
to develop a robust gait recognition system [2]. However, robust
recognition is still challenging due to the presence of large
intraclass variations in a person's gait which substantially
changed the performance. In this section, we briefly discuss the
literature of the two categories of existing gait recognition
techniques: appearance-based and model-based methods. Next,
we review some of the recent deep learning-based gait
recognition approaches which are closely related to our work.
A. Appearance-based Methods
Most of the previous work following this approach [3-5] used
human silhouette masks as the main source of information and
extracted features that show how these mask change. The most
popular gait representation employed in such work is gait energy
image (GEI) [3], a binary mask computed through aligning and
averaging the silhouettes over the complete gait cycle. Though
there are many other alternatives for GEI, e.g., gait entropy
image (GEnI) [4], and gait flow image (GFI) [5], due to its in-
sensitiveness of incidental silhouettes error, it has been
considered as the most stable gait features. It can achieve good
performance under controlled and cooperative environments, but
doesn’t show robustness when the view angle and clothing
condition change.
In order to reduce drastic change of the shape of GEI, Huang
et al. [6] fused two new gait representation: shifted energy image
and the gait structural profile to increase the robustness to some
classes of structural variations. But, the performance of this
method is not good enough due to the loss of temporal
information while calculating GEI. In [7], GaitSet has been
proposed where a gait is regarded as a set consisting of
independent frames rather than a template or sequence. Though
it handled cross-view conditions very well, it is not good enough
in handling cross-carrying and cross-clothing conditions.
B. Model-based Methods
Model-based methods [8-11] are based on the extraction and
modeling of the human body structure as well as the local
movement pattern of these parts. Therefore, this approach is
often built with a structural and a motion model to capture both
static as well as dynamic information of gait. For example, In [8],
Yam et al. developed an automated model-based approach to
recognize people using walking as well as running gait by
analyzing the leg motion. They used the Biomechanics of human
locomotion and coupled oscillators and employed a bilateral
symmetric and an analytical model to successfully extract the leg
motion. Ariyanto et al. [9] employed a structural model including
articulated cylinders for fitting the 3D volumetric subject data at
each joint to model the lower legs. In [10], authors presented a
model-based approach where they captured the discriminatory
features of gait by analyzing the leg and arm movements. For
recognition, they used K-nearest neighbor classifier and Fourier
components of the joint angle.
So, Model-based approaches are generally invariant to
various intraclass variations like clothing, carrying and view
angle variations, etc. However, the main drawback of this
approach is the extraction process of body parameters like height,
knee, and torso which is computationally expensive and highly
dependent on the quality of the video.
C. Deep Learning for Gait Recognition
Due to its powerful feature learning abilities, convolutional
neural networks (CNNs) have achieved great success in object
recognition task in recent years. Several CNN-based gait
recognition methods [12-17] have been proposed which can
automatically learn robust gait features from the given training
samples. Additionally, using CNNs, we now can execute feature
extraction and perform recognition within a single framework
using train samples. Wu et al. [12] performed cross-view gait
recognition by developing three convolutional layer network
using the subject's GEI as input. Shiraga et al. [13] designed a
eight-layered CNN network, GEINet, which consist of two
sequential triplets of convolution, pooling, normalization layers,
and two fully connected layers for large-scale gait recognition on
OU-ISIR database.
In [14], Wolf et al. used 3D convolutions for multi-view gait
recognition by capturing spatio-temporal features from raw
images and optical flow information. A Siamese neural network-
based gait recognition system has been developed in [15] where
GEI was feed as input. In [16], Yu et al. used generative
adversarial nets to design a feature extractor in order to learn the
invariant features. In [17], they further improved the GAN-based
method by adopting a multi-loss strategy to optimize the network
to increase the inter-class distance and to reduce the intraclass
distance at the same time.
D. Pose-based Gait Recognition
In recent years, there has been a huge interest in the study of
deep learning-based approaches for the task of real-time pose
estimation from image and video. The task of pose estimation
mainly involves localizing the keypoints of human figure to
estimate the locations of different body parts. It can broadly be
classified into two categories: single-person and multi-person
pose estimation. To recognize multi-person pose, Cao et al. [18]
developed a deep CNN-based regression method to estimate the
association between anatomical parts in the image. Their bottom-
up method achieved state-of-the-art performance on multiple
benchmark datasets. In this work, we employed their pretrained
model to get an accurate 2D pose estimation on our experimental
dataset.
With the advent of the pose-estimation algorithms in
computer vision, the recognition of human gait based on pose
information has received much more attention [11, 19, 20] due
to its effective representation of gait features and robustness
toward covariate condition variations. Feng et al. [11] used the
human body joint heatmap to describe each frame. They fed the
joint heatmap of consecutive frames to long short term memory
(LSTM). Their gait features are the hidden activation values of
the last timestep. In [19], Liao et al. constructed a temporal-
spatial network (PTSN) to extract the spatial-temporal features
of gait from 2D human pose information. Authors in [20],
employed 3D pose estimation in their PoseGait network to
extract the spatial-temporal gait features and achieved better
performance compared with 2D pose estimation.
Again, some of the most successful approaches for human
action recognition employ RNNs [21, 22] to effectively model
the temporal sequences of human skeleton data. Song et al. [21]
proposed an end-to-end spatial and temporal attention model
with LSTM for human action recognition from skeleton data. In
[22], Du et al. proposed an end-to-end hierarchical RNN network
for skeleton-based action recognition. They divided the human
skeleton into five different parts and then separately feed them
into five sub-networks.
Our approach to gait recognition is similar to these
approaches. In this study, we have proposed a simple RNN
architecture that effectively models the discriminative gait
features in a temporal domain.
III. PROPOSED METHOD
In this section, we are going to discuss the proposed
framework and its main components in detail.
A. Overview
The workflow of our proposed network is illustrated in
Figure 1. Many strategies have been taken to design and train the
network for robust recognition. Firstly, we designed a novel
spatio-temporal gait feature descriptors based on the 2D human
poses estimated from raw video frames using an improved
OpenPose [18] algorithm. Thereafter, the gait descriptors were
fed into a 2-layer BiGRU network which models the descriptors
to recognize the subject ID.
B. Forming Spatio-Temporal Features
1) 2D Body Joints
As every joint of the human body does not have a significant
role in gait pattern, they cannot improve gait recognition
accuracy. Some joints perform even worse. So, among the 25
body joints as estimated from OpenPose algorithm we searched
out those joints which have a rich and discriminative gait
representation capacity. Cunado et al. [23] used the human leg-
based model as they found that the change in human leg contains
the most important features for gait recognition. In our study, we
found that knee along with the joints located in the feet show
more robustness than any other body joints because they do not
change while people are walking in cloths or carrying bags. On
the other hand, some joint, e.g. hip joints get wider in coat than
normal condition. Again, in some gait videos, some subjects put
their hands into their coat pocket, which they cannot do in normal
walking. This situation significantly changes the joint
coordinates. Moreover, joints above hip do not have any
significant impact on gait pattern. Hence, we do not consider hip
or any other body joints above it.
Figure 1. The overview of the proposed framework for gait recognition. 2D
human poses were first extracted from raw video frames using improved
OpenPose [18] algorithm. Four different types of spatial-temporal features
were then extracted to form a 50-dimensional feature vector. Thereafter a pose
sequence of timestep each having a length of 28 frame was formed to feed into
a temporal network. The temporal network identified the subject by modeling
the gait features.
Consequently, in our work, as shown in Figure 2(a) we
selected 6 body joints (RKnee, Rankle, RBigToe, LKnee,
LAnkle, LBigToe) to form our effective pose features. Thus, we
have 12-dimensional pose feature vector, fpose, for a single frame.
(1)
Figure 2. Different features extraction process of the proposed method. a) 6
effective joints were selected out of 25 body joints as estimated from pose
estimation algorithm [18]. Those selected joints formed a 12-dimensional pose
vector. b) 5 angular trajectories from lower limbs were considered to form a
joint-angle feature vector. c) A total of 8 body joints were selected to get
temporal displacement feature vector. d) 7 body parts were taken to form a
limb length feature vector.
It is necessary to normalize the pose sequence data with
regard to the subject position in frame, size, or speed of walking
to get improved performance. In different gait datasets, as people
walk through the fixed camera, the size of the subject’s body
alters due to change in the distance between the subject and the
camera changes. In our study, to find the origin of the coordinate
system (Jc) for each subject, we considered right, left, and middle
of the hip joints and calculated the average of them. Again, to
normalize the bodies of different subjects to a fixed size, we took
h, the euclidean distance from hip to neck joint, as unit length.
Equation (2) describes the normalization procedure of the joint
coordinates.
Jc = (JLHip+JRHip+JMHip) / 3
h = || Jc – Jneck ||2 (2)
JiN = (Ji – Jc) / h
Here, JiN be the new coordinate of ith joint Ji of a particular
pose.
Figure 3. Examples of 2D human pose estimation by [18] from RGB images
of CASIA dataset (left ones). Detected 25 human body joints with description
are shown. (right ones)
TABLE I. LIST OF SELECTED JOINT-ANGLE TRAJECTORIES WITH
CORRESPONDING BODY JOINT SET IN ORDER TO FORM JOINT ANGULAR FEATURE
VECTOR
Angular Trajectory
Body joint set
Hip trajectory
10, 8, 13
Right knee trajectory
11, 10, 9
Left knee trajectory
14, 13, 12
Right ankle trajectory
22, 11, 10
Left ankle trajectory
19, 14, 13
2) Joint Angular Trajectory
The dynamics of human gait motion can be expressed by the
temporal information of joint angles. Hence, discriminative gait
features can be found by considering the change in joint-angle
trajectories of the lower limbs [24]. Therefore, in this study, we
formulated a 15-dimensional feature vector ftrajectory by
considering five lower limb joint-angle trajectories using
following equations:
(3)
As shown in Figure 2(b), J1, J2, J3 are the joints which form a
set of angular trajectory (). In this work, we considered five sets
of angular trajectories from the lower limb of human body. Table
I demonstrated the selected angular trajectories with their
corresponding body joints. For each trajectory, we took ( )
as gait features.
(4)
3) Temporal Displacement
Our third type of feature was a simple descriptor that
preserves temporal information. It stores the local motion
features of gait by keeping the displacement information between
the two adjacent frames of the pose sequence. The displacement
of each coordinates of a joint was then normalized by the total
length of displacement of all joints for a particular frame. Let, t
and (t+1) are two adjacent frames. The displacement information
of the coordinates of any joint of frame t would be the normalized
difference between the corresponding coordinates.
(5)
Here,
is the 2D coordinates of the ith body joint at tth frame
in the video and (
) is the displacement of the coordinates
of first joint at tth frame of the video. As shown in Figure 2(c), we
selected 8 joints (Neck, MHip, RKnee, Rankle, RBigToe, LKnee,
LAnkle, LBigToe) to get a 16-dimensional feature vector,
fdisplacement.
4) Body Part Length Features
The static gait parameters, for example, the length of the body
parts calculated from joints position are also important for gait
recognition [24, 25]. They form a spatial gait feature vector
which make them robust against covariate such as carrying and
clothing variation. In this work, we took seven body parts (Figure
2(d)) namely length of the two leg, two feet, two thigh and width
of the shoulder which formed a 7-dimensional spatial feature
vector fbody-part .
5) Fusion of Features
A Lot of research works have been done to fuse multiple
features to get improved performance [20, 24]. Different types
of fusion methods are proposed in literature such as feature level
fusion, representation level fusion, and score level fusion. In
feature level fusion, multiple features of the same frame are
concatenated before feeding into a final network and in
representation level fusion, each feature vector is firstly fed into
a network and the resulting global representations are then
concatenated to train a final classifier. For score level fusion,
each feature vector is separately fed into the final network which
predicts a classification score. Then, the scores from multiple
classifiers are fused using an arithmetic mean.
In this experiment, we found that feature level fusion has
produced better recognition results in contrast to other fusion
techniques or individual feature sets.
C. Feature Preprocessing
From 2D pose estimation algorithm [18], we got 25 body
joints from each frame (Figure 3). We took several preprocessing
steps to address the problem of missing data due to occlusions.
The main strategies are:
• If the origin of the coordinate system can't be calculated
due to missing hip joints, the frame should be rejected.
• If more than 1 body joint is missing in between knee and
ankle joints of both leg, the frame should be rejected due
to having little information.
• In other cases, individual joints were not located in the
frame and a position of (0.0, 0.0) was given to that joint.
The above strategies are simpler which do not require any
computation and proven to be effective in addressing the missing
data problem. Thus, in this work, we designed a 50-dimensional
spatio-temporal gait feature vector p from the raw 2D pose
estimation of each frame.
We split a gait video into 28 frame segments. Each 28 frame-
segment formed a timestep which can be described by equations
(6). Here, p is a 50-dimension pose vector for each frame; T is
the feature matrix for each timestep; N is the total number of
timestep sequence, and V is the sequence of features for a gait
video.
(6)
In CASIA dataset, gait videos of different subject have
varying timesteps. The number of timesteps in each gait video
depends on the total number of frames where a person is detected.
Due to the position of the camera, some angles (00, 180, 360) have
more person detected frame than other angles (720, 900, 1080).
Therefore, the total number of timesteps in a gait video is
different for different subjects and viewing angles. This varying
timestep makes our train dataset unbalanced. Again, in CASIA
B dataset, not all subjects have all gait videos; there are some
missing gait videos. To solve the problem, we develop our own
balance training set by making each subject pose sequence to
have a fixed number timesteps. We first found the subject which
had maximum timesteps for a particular gait angle and then
augmented other subject's timesteps with that specific length by
overlapping their sequences.
D. Data Augmentation
The performance of deep neural networks is strongly
correlated with the amount of available training data. Although
CASIA is the largest gait dataset [1], the standard experimental
setup of this dataset (see Table IV) allows us to train only the
four normal walking sequence for each subject. Therefore, we
need to augment our train data to obtain a stable model.
One way to increase the amount of training data is to overlap
the video clips. Thus, we split the input video into an overlapping
sequences of video clips. For every 28 frame clip, we overlapped
24 images of the previous clip at almost 85.7% overlapping rate.
For example, a particular gait video of 100 frames would be split
into the clips (1-28), (5-32), (9-36), ... up to frames (73, 100).
In addition to above technique, we further augment our
training data by adding another gait sequence (i.e., 25%
increment) by implementing Gaussian noise to a given normal
walking sequence.
(7)
Here, and are two random real numbers generated by a
normal distribution with zero mean and unit standard deviation.
We apply noising (N) into the raw joints position of a training
pose data.
Figure 4. Proposed RNN architecture for robust gait recognition. It consists
of two BiGRU [26] layers each of which consists of 80 GRU cells with one
batch normalization and one output softmax layer. The network was fed with a
50-dimensional spatio-temporal feature vector obtained from 2D pose
estimation. Input layer was followed by a batch normalization layer [27]. The
output of the recurrent layers was also batch normalized to standardize the
activations and finally fed into an output softmax layer. For the output layer,
the number of the output neuron equals to the number of subjects.
TABLE II. HYPERPARAMETER SETTINGS OF OUR PROPOSED NETWORK.
Hyper-parameter
Value
Optimizer
Adam [28]
Objective function
Fusion of Softmax and Center loss
Epochs
450
Mini-batch size
256
Initial learning rate
1 x 10-3
E. Network Architecture
In this research, we experimented with different RNN
architectures such as Gated Recurrent Units (GRUs), Long
Short-Term Memory Units (LSTMs), and Bidirectional Gated
Recurrent Units (Bi-GRU) [26]. Firstly, we designed the
proposed network employing all these architectures with one
recurrent layer and then, searched for optimum recurrent unit size
between 50 to 150. Thereafter, we increased the capacity of the
network by adding the second and third layers of hidden units.
Finally, we found that, among different RNN architectures, 2-
layer BiGRU with 80 hidden units performs best.
After input and the second recurrent layer, we placed a batch
normalization (BN) [27] layer. At last, a fully connected layer
with softmax activation was used to predict the subject classes.
Figure 4 illustrates the architecture of the proposed network.
F. Training
The training of RNNs allows us to learn the parameters from
the sequence. We have employed Adam [28] optimization
algorithm with = 0.9, = 0.999. We tried several learning
rates in our experiment and found out that the best initial learning
rate is (1 x10-3). We also reduced the learning rate by a factor
when it hit a plateau. Reducing the learning rate will allow the
optimizer to get rid of the plateaus in the loss surface. Table II
summarizes all the hyperparameters setting of our network.
The proposed network was trained with a batch size of 256
for 450 epochs. Our network showed some overfitting mostly
due to the high learning capacity of the network over data. This
overfitting problem has been addressed by adding a batch
normalization layer.
We also tried to add dropout layer during training, but that
did not help to reduce the overfitting problem. Moreover, it
degraded gait recognition performance. Hence, we skip it.
G. Loss Functions
In this work, we found that due to the influence of various
covariate factors, intraclass distance related to one subject is
sometime more significant than interclass distance. Now, if we
only use the cross-entropy loss as our objective function, the
resulting learned features may contain large intraclass variations.
Therefore, to effectively reduce the intraclass distance, we used
center loss, introduced by Wen et al. [29] for face recognition
task. As the training progresses, the center loss learns a center for
features of each class and the distances between the features and
their corresponding class centers are minimized simultaneously.
However, using only center loss may lead the learned features
and centers close to zeros due to the very small value of the center
loss. Hence, with the fusion of softmax loss (Ls) and center loss
(Lc), we can achieve discriminative feature learning by
increasing interclass dispersion and compacting intraclass
distance as much as possible.
(8)
Equations (8) describe the total loss (L) calculation of our
network, where denotes the ith pose sequence which
belongs to the yi class and denotes to the yi class center of
the learned pose features. is the feature dimension of
the last fully connected layer and b R is the bias term of the
network. The batch size and the class number are m and n
respectively. , a scalar variable, is set to value 0.01 to balance
the two loss functions. || ||2 refers to the kernel regularizer for
all the parameters of the network with a weight decay coefficient
() set to 0.0005 for the experiment.
H. Post-processing
While training, our proposed network considers each of these
video clip as a separate video (see Figure 5). For a given video,
the prediction of our model is a sequence of class probabilities
for each of the timestep, i.e., 28 frame clip.
But, while testing, we actually need the subject ID for the
complete gait video. Therefore, we used the majority voting
scheme to process this output to predict the subject ID. In this
scheme, the subject that receives the highest number of votes
over all timesteps in a gait video is referred to as the predicted
class.
Figure 5. Output prediction scheme of our proposed temporal network. Each
input clip was considered as a separate video and a sequence of class
probabilities was predicted at output. Majority voting scheme was used to
process the output to predict the subject ID.
Let’s consider, s is a vector of n number of subjects. For a
particular timestep t of a gait video, input feature matrix
has an n-dimensional output vector .
(9)
Where,
refers the probability of input feature
matrix belongs to class si . Now, we can assign the output class
st to the subject si which have maximum probabilities for the
timestep t. As each of our gait videos is divided into a series of
timestep sequence (see Equation 6), using majority voting
scheme we can have the subject ID. Following equations
described the voting scheme.
(10)
Here, N is the total number of timesteps in which a gait is split
and s is the final predicted class.
IV. EXPERIMENTAL EVALUATION
This section briefly discusses the datasets we used to train
and evaluate our model, and the results our proposed algorithm
achieved at different experimental setup. As to estimate pose
RGB video frames are required, we couldn't evaluate our method
to those dataset which only consists of silhouette sequences.
A. Datasets
The success of deep learning-based methods greatly depends
on the vast amount of labeled train data. Unfortunately, few
existing gait databases contain a large number of subjects as well
Figure 6. Sample video frames of CASIA A and CASIA B dataset. In top,
some of the sample images from CASIA A dataset are shown where subjects
walking along straight line in 3 different view angles, and in bottom, CASIA B
dataset is shown with its 11 view angles.
as a variety of covariate factors. Some of the publicly available
gait databases are CASIA gait dataset [1], TUM GAID dataset
[30], OU-ISIR multi-view large population dataset (OU-MVLP)
[31] and USF HumanID dataset [32].
In USF HumanID gait dataset, there are 122 subjects walking
outside on two different surfaces of an elliptical path under two
different time, viewpoint, clothing, shoes, and carrying
conditions. However, not all subjects were filmed under all
conditions. TUM GAID dataset is another large dataset for gait
recognition which consists of 305 subjects where each subject
has 10 videos. But this dataset is not suitable for multi-view gait
recognition as all the videos were recorded from side view angle.
The largest dataset available for gait recognition is OU-ISIR
multi-view large population dataset (OU-MVLP). It contains
10,307 subjects from 14 viewing angles ranging from 00-900,
1800-2700. Only two sequences are provided, one for the gallery
and the other for the probe. But this dataset is formatted only as
a set of silhouette sequence making it different from our
approach.
In this study, we used CASIA (both CASIA A and CASIA B)
dataset which is one of the largest multi-view gait databases.
CASIA A dataset contains total 20 subjects walking in an
outdoor environment where CASIA B dataset includes total 124
subjects walking in an indoor environment. In CASIA A gait
dataset, each subject walk along a straight line in 3 different view
angles lateral (00), oblique (450) and frontal (900). For each
viewing angle every subject has four gait sequences of which two
of them have same walking direction while the other two have
opposite direction. In CASIA B dataset, there are 10 walking
sequences of each subject captured from 11 view angles: 6
sequences for normal walking (‘nm’), 2 sequences for walking
in a coat (‘cl’) and 2 sequences for walking with bag (‘bg’) on
shoulder. Hence, this dataset separately considered three
variations in people walking namely viewing angle, clothing and
carrying conditions. The view angle set of the camera is ranging
from 00 to 1800. Figure 6 illustrates some of the sample video
frames of CASIA dataset.
B. Experimental Results on CASIA A dataset
Since, CASIA A dataset contains only 20 subjects each of
which have only four gait sequence in three different angles, we
trained three model for each of the gait angle with 20 output
neurons in the final softmax layer of our proposed temporal
network. To evaluate the performance of our proposed method
on CASIA A dataset, we used leave-one-out cross validation rule,
i.e., one sequence was set for testing and the remainder was set
for training the network for each view angle. We compare our
results with four other prevailing state-of-the-art gait recognition
methods including Wang [33], Goffredo [34], Liu [35], Lima
[36], Kusakunniran [37] (see Figure 7). Table III illustrates that
the proposed method has achieved higher average correct class
recognition rates (CCR), i.e. 100.0% compared to other methods.
C. Experimental Results on CASIA B Dataset
1) Experimental Setup
We designed two experimental setups (A, B) in CASIA B
dataset for evaluation. Experiment setup A is for evaluating the
performance of the proposed method in single-view gait
recognition.
Figure 7. Comparison in correct class recognition rate (CCR) at different
viewing angles among proposed method with other prevailing gait recognition
methods proposed in literature on CASIA A dataset. Our method achieves
100% class recognition rate on all viewing angles which proved the efficacy of
the proposed method.
TABLE III. COMPARISON AMONG DIFFERENT STATE -OF-THE-ART GAIT
RECOGNITION METHODS WITHOUT VIEW VARIATION IN ALL THREE VIEW ANGLES
OF CASIA A DATASET. IT HAS BEEN OBSERVED THAT THE PROPOSED METHOD
ACHIEVES HIGHER AVERAGE RECOGNITION RATES 100.0% AND OUTPERFORMS
OTHER STATE-OF-THE-ART METHODS BY A LARGE MARGIN.
Methods
Data Angles
00
450
900
Mean
Wang [33]
88.75
87.50
90.00
88.75
Goffredo [34]
100.0
97.50
91.00
96.16
Liu [35]
85.00
87.50
95.00
89.10
Lima [36]
92.50
97.50
98.75
96.25
Kusakunniran [37]
100
100
98.75
99.58
Proposed
100.0
100.0
100.0
100.0
TABLE IV. EXPERIMENTAL SETUP FOR THE CASIA B DATASET. THE
DATASET IS DIVIDED INTO TWO DIFFERENT SETUPS TO ORGANIZE TWO
DIFFERENT TYPES OF EXPERIMENT. THE EVALUATION IS SUBDIVIDED INTO A
GALLERY SET AND A PROBE SET. GALLERY SET CONSISTS OF THE FIRST 4
NORMAL WALKING SEQUENCES OF EACH SUBJECT AND THE PROBE SET
CONTAINS REST OF THE WALKING SEQUENCES
Setup
Training
Evaluation
Sequences
ID
Total
ID
Total
Gallery
Probe
A
1- 62
62
63-
124
62
nm01−
nm04
nm05−nm06
bg01−bg02
cl01−cl02
B
1-74
74
75-
124
50
TABLE V. CORRECT CLASS RECOGNITION RATE (%) OF THE PROPOSED
METHOD IN ALL THREE PROBE SETS OF CASIA B DATASET. HERE, EACH ROW
REPRESENTS A SPECIFIC VIEW OF GALLERY AND PROBE SET. IT HAS BEEN
OBSERVED THAT THE PROBE SET OF NORMAL WALKING (PROBENM) ACHIEVES
99.41% AVERAGE RECOGNITION RATE WHILE THE PROBEBG AND PROBECL SET
ACHIEVE 97.80% AND 82.82% AVERAGE RECOGNITION RATES RESPECTIVELY.
g
Probe Type
g
Probe Type
Probe
NM
Probe
BG
Probe
CL
Probe
NM
Probe
BG
Probe
CL
00
100
100
81.52
1080
100
96.77
83.28
180
100
100
82.11
1260
100
98.39
84.16
36
0
100
100
83.58
1440
100
98.39
83.58
540
100
100
85.48
1620
98.39
95.16
80.65
720
100
98.39
84.46
1800
96.77
91.93
78.45
900
98.39
96.77
83.72
Avg.
99.41
97.80
82.82
Figure 8. Comparison of CCR among different gait recognition algorithms on
CASIA B dataset without view variation. The results of the comparison prove
that the proposed method outperforms other state-of-the-art methods in all
three probe set of CASIA B dataset by achieving 99.41% CCR in normal
walking and 97.80% and 82.82% CCR under two covariate conditions
ProbeCL, and ProbeBG respectively. This outcome verifies the robustness of
the proposed method against carrying and clothing conditions variation.
To investigate the robustness of view variation, comparison
results of the proposed method against other state-of-the-art
methods in different view variations have been reported.
Experiment setup B is designed for evaluating the cross-view
recognition performance.
TABLE VI. COMPARISON BETWEEN THE PROPOSED METHOD AND OTHER
STATE-OF-THE-ART GAIT RECOGNITION METHODS IN CASIA B DATASET
WITHOUT VIEW VARIATION . IT HAS BEEN OBSERVED THAT THE PROPOSED
METHOD OUTPERFORMS OTHER METHODS IN ALL THREE PROBE SET OF CASIA B
DATASET. AS THE PROPOSED METHOD DOESN'T DEPEND ON ANY BODY JOINT
HIGHER THAN KNEE, IT SHOWS THE ROBUSTNESS TOWARDS THESE COVARIATE
FACTORS. IT ALSO ACHIEVES HIGHER AVERAGE CORRECT CLASS RECOGNITION
RATE (CCR) 93.34% BY OUTPERFORMING OTHER METHODS AT A SIGNIFICANT
MARGIN.
Methods
Probe Type
ProbeNM
ProbeBG
ProbeCL
Average
Liao et al. [19]
96.92
85.78
68.11
83.60
Yu et al. [38]
97.58
72.14
45.45
71.72
Yu et al. [17]
98.24
76.25
42.89
72.46
Liao et al. [20]
96.63
71.26
54.18
74.02
Proposed
99.41
97.80
82.82
93.34
For setup A, as demonstrated in Table IV, we divided the
dataset into two groups where the first group which consists of
62 subjects was used to train the network. The second group
contains rest of the subjects for evaluating the performance of
the model. For setup B, the ratio between train and evaluation
set was 24 to 100. In the evaluation set for both setup, 4 normal
walking sequences of each subject are put into gallery set and
rest 6 walking sequences consist three probe set (ProbeNM,
ProbeBG, ProbeCL). ProbeNM consists of 2 other normal
walking sequences where ProbeBG and ProbeCL consists of
two subjects carrying bag and wearing coat respectively.
2) Results on Single-View Gait Recognition without View
Variation
Experimental results of single-view gait recognition on all the
three probe set of CASIA B dataset without view variation is
illustrated in Table V. We achieved higher average recognition
rate 97.80% and 82.82% on the probe set of (ProbeBG) and
(ProbeCL) respectively. This performance proves the robustness
of our proposed method toward both carrying and clothing
covariate conditions. We also achieved higher average class
recognition rate 99.41% on normal walking condition.
3) Comparison with the State-of-the-art Methods without
View Variation
We compare our experimental results with other state-of-the-
art methods such as GaitGANv2 [17], PTSN [19], PoseGait [20],
Yu et al. [38] as shown in Figure 8. The experimental setup for
all these methods were set A (see Table IV). Table VI reports that
CCR of the proposed method outperforms all other methods in
all three covariate conditions of CASIA B dataset; our method
achieved average CCR of 93.34% with improvement of approx.
10% from PSTN.
4) Results on Single-View Gait Recognition with View
Variation
The performance of the proposed method on single-view gait
recognition with view variation is demonstrated on Table VII.
Here, for a specific gallery (g) angle the average CCR (%) of all
eleven probe angles has been reported; our method achieved
average CCR of 62.69%, 47.23%, and 33.46% for Probe NM,
probe BG, and probe CL respectively.
TABLE VII. THE AVERAGE RECOGNITION RATES FOR ALL THREE PROBE
SETS OF CASIA B DATASET. EACH ROW REPRESENTS THE AVERAGE VALUE OF
ALL ELEVEN PROBE ANGLES AT A SPECIFIC GALLERY ANGLE (G) IN ALL THREE
PROBE SETS.
g
Probe Type
g
Probe Type
Probe
NM
Probe
BG
Probe
CL
Probe
NM
Probe
BG
Probe
CL
00
61.73
45.01
32.40
1080
64.22
48.39
34.75
180
63.64
47.80
32.99
1260
62.02
47.07
32.40
360
67.30
48.97
34.46
1440
58.80
47.51
31.82
540
68.33
50.15
37.24
1620
56.45
44.13
29.77
720
68.33
50.44
39.0
1800
52.35
40.91
26.83
900
66.42
49.12
36.36
Avg.
62.69
47.23
33.46
TABLE VIII. COMPARISON AMONG DIFFERENT STATE-OF-THE-ART
METHODS FOR GAIT RECOGNITION WITH VIEW VARIATION IN ALL THREE PROBE
SETS OF CASIA B DATASET. EACH ROW REPRESENTS THE AVERAGE VALUE OF
ALL THE GALLERY VIEW’S AVERAGE RECOGNITION RATE. IT HAS BEEN SEEN
THAT, SIMILAR TO FIRST EXPERIMENT, THE PROPOSED METHOD ACHIEVES
HIGHER ACCURACY IN TWO DIFFERENT PROBE SET (PROBEBG, PROBECL) AND
COMPARABLE PERFORMANCE IN NORMAL WALKING WITH OTHER METHODS.
Methods
Probe Type
ProbeNM
ProbeBG
ProbeCL
Yu et al. [38]
62.82
40.38
26.05
Yu et al. [17]
66.34
46.17
25.91
Liao et al. [20]
63.78
42.52
31.98
Proposed
62.69
47.23
33.46
Figure 9. Comparison with different state-of-the-art methods for gait
recognition with view variation in all three probe set of CASIA B dataset.
Here, the value reported for each algorithm is the average of all the gallery
view’s average CCR. Proposed method outperforms other state-of-the-art
methods achieving 47.23% and 33.46% in two covariate conditions ProbeBG,
and ProbeCL respectively.
5) Comparison with the State-of-the-art Methods with View
Variation
To better illustrate the robustness of our gait recognition
method to view variation, the proposed method has been
compared to three other state-of-the-art methods such as
GaitGANv2 [17], PoseGait [20], Yu et al. [38]. It has been
observed in Figure 9 and Table VIII comparison that the
proposed method outperforms others in different covariate
conditions and achieves comparable performance in normal
walking.
Since, to recognize gait, we consider features based on
effective body joints, hence our method doesn't get affected by
the variation in covariate conditions compared to other
appearance-based method or other model-based methods which
consider ineffective features to build their gait descriptor. That’s
why our method is proven to be less sensitive to the view
variation and performs better in carrying-bag and clothing
condition.
6) Comparison with the State-of-the-art Methods on Cross-
view Gait Recognition
The gait recognition scheme in which gallery and probe set
are getting matched from two different views is commonly
known as cross-view gait recognition. To show the effectiveness
of our method in cross-view recognition, we make the
comparison between the proposed method and three other state-
of-the-art methods including CNN [12], CMCC [39], and GEI-
SVR [40] with the same experimental setup. The probe angles
were selected 00, 540, 900, and 1260 for comparison.
Although, the proposed method contains only one model to
handle any view angle variation, it achieves comparable
performance with other prevailing state-of-the-art methods
proposed in literature which were specially designed and trained
for cross-view gait recognition. From Table IX, it is seen that
CNN [12] achieves the highest recognition rates when the view
variation is large due to the use of supervised information of all
gallery angles during training.
TABLE IX. COMPARISON OF OUR PROPOSED METHOD WITH THE PREVIOUS
BEST RESULTS OF CROSS-VIEW GAIT RECOGNITION AT DIFFERENT PROBE
ANGLES OF CASIA B DATASET BY CCR (%). THE NETWORK WAS TRAINED
ACCORDING TO EXPERIMENTAL SETUP B TO HAVE THE SAME SETUP WITH
OTHER METHODS.
Probe
View
Gallery
View
CNN
CMCC
GEI-SVR
Proposed
00
180
95.0
85.0
84.0
97.0
360
73.5
47.0
45.0
80.0
540
180
91.5
65.0
64.0
83.0
360
98.5
97.0
95.0
100.0
720
98.5
95.0
93.0
100.0
900
93.0
63.0
59.0
83.0
900
540
-
66.0
63.0
84.0
720
99.5
96.0
95.0
96.0
1080
99.5
95.0
95.0
95.0
1260
-
68.0
65.0
71.0
1260
900
92.0
78.0
78.0
76.0
1080
99.0
98.0
98.0
92.0
1440
97.0
98.0
98.0
96.0
1620
83.0
75.0
74.0
77.0
The comparison in Table IX also illustrates that the proposed
method performs better when the view variation is small. The
reason for not achieving better performance at large view
variation is because it was trained with only one viewing angle.
V. CONCLUSION
In this paper, a novel features extraction technique was
proposed from 2D human pose estimation to find the effective
gait features for view-invariant gait recognition robust to
covariate factors. We also present a novel RNN architecture
which is much more simple, efficient and computationally
inexpensive compared to other state-of-the-art architectures
present in literature. We considered human pose information as
gait features for our network because it not only has rich gait
representation capacity but also shows robustness towards the
variation of carrying and clothing condition. Experimental
results on challenging CASIA A and CASIA B gait dataset
clearly depicts that the method proposed in this paper
outperforms the existing state-of-the-art methods in literature.
In future, more accurate pose estimation algorithm can
improve cross-view recognition rate greatly especially in a large
view variation, which will further boost our performance and
lead us to achieve state-of-the-performance. Using a larger
dataset containing thousands of subject will help us to develop a
more stable network suitable for practical applications like real-
time surveillance.
REFERENCES
[1] S. Yu, D. Tan, T. Tan, “A framework for evaluating the effect of view
angle, clothing and carrying condition on gait recognition,” 18th Int. Conf.
on Pattern Recognition, Hong Kong, China, 2006, pp. 441–444.
[2] I. Rida, N. Almaadeed, and S., Almaadeed, “Robust gait recognition: a
comprehensive survey,” IET Biometrics, vol. 8, no. 1, pp. 14–28, Jan.
2019.
[3] J. Han, and B. Bhanu, “Individual recognition using gait energy image, ”
IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 28, no. 2 , pp. 316–
322, Feb. 2006.
[4] K. Bashir, T. Xiang, and S. Gong, “Gait recognition using gait entropy
image,” 3rd Int. Conf. on Imaging for Crime Detection and Prevention,
London, UK, 2009.
[5] T.H.W. Lam, K.H. Cheung, and J.N.K. Liu, “Gait flow image: A
silhouette-based gait representation for human identification,” Pattern
Recognition, vol. 44, no. 4, pp. 973–987, Apri. 2011.
[6] X. Huang, and N.V. Boulgouris, "Gait recognition with shifted energy
image and structural feature extraction", IEEE Trans. Image Process., vol.
21, no. 4, pp. 2256–2268, Apr. 2012.
[7] H. Chao, Y. He, J. Zhang, and J. Feng, "Gaitset: regarding gait as a set for
cross-view gait recognition," The Thirty-Third AAAI Conference on
Artificial Intelligence, 2019.
[8] C. Yam, M. S. Nixon, and J. N. Carter, “Automated person recognition
by walking and running via model-based approaches,” Pattern
Recognition, vol. 37, no. 5, pp. 1057–1072, 2004.
[9] G. Ariyanto, and M.S. Nixon, “Model-based 3d gait biometrics,” Int. Joint
Conf. on Biometrics. Washington DC, USA, 2011. pp. 1–7.
[10] F. Tafazzoli, and R. Safabakhsh, “Model-based human gait recognition
using leg and arm movements,” Engineering Appl. of Art. Intell., vol. 23,
no. 8, pp. 1237–1246, Dec. 2010.
[11] Y. Feng, Y. Li, and J. Luo, “Learning effective gait features using LSTM,”
23rd Int. Conf. on Pattern Recognition, Cancun, Mexico, 2016. pp. 325–
330.
[12] Z. Wu, Y. Huang, L. Wang, X. Wang, and T. Tan, “A comprehensive study
on cross-view gait based human identification with deep CNNs,” IEEE
Trans. on Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 209–226, Feb.
2017 .
[13] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi,
“GEINet: view-invariant gait recognition using a convolutional neural
network,” Int. Conf. on Biometrics, Halmstad, Sweden, 2016.
[14] T. Wolf, M. Babaee, and G. Rigoll, “Multi-view gait recognition using 3D
convolutional neural networks,” IEEE Int. Conf. on Image Processing,
Phoenix, AZ, USA, 2016, pp. 4165–4169.
[15] C. Zhang, W. Liu, H. Ma, and H. Fu, “Siamese neural network based gait
recognition for human identification,” IEEE Int. Conf. On Acoustics,
Speech and Signal Processing, Shanghai, China, 2016, pp. 2832–2836.
[16] S. Yu, H. Chen, E.B.G. Reyes, and N. Poh, “GaitGAN: invariant gait
feature extraction using generative adversarial networks,” IEEE Conf. on
Computer Vision and Pattern Recognition Workshops, Honolulu, HI,
USA, 2017, pp. 532–539
[17] S. Yu, R. Liao, W. An, H. Chen, E. B. Garcia, Y. Huang, and N. Poh,
“GaitGANv2: Invariant gait feature extraction using generative
adversarial networks,” Pattern Recognition, vol. 87, pp. 179–189, Mar.
2019.
[18] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. A. Sheikh, “OpenPose:
realtime multi-person 2D pose estimation using part affinity fields,” IEEE
Trans. on Pattern Anal. and Mach. Intell. 2019.
[19] R. Liao, C. Cao, E.B., Garcia, S. Yu, and Y. Huang, “Pose-based temporal-
spatial network (PTSN) for gait recognition with carrying and clothing
variations,” Chinese Conf. on Biometric Recognition, 2017, pp. 474–483.
[20] R. Liao, S. Yu, , W. An, H. Chen, and Y. Huang, “A model-based gait
recognition method with body pose and human prior knowledge,” Pattern
Recognition, vol. 98, Feb. 2019.
[21] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “An end-to-end spatio-
temporal attention model for human action recognition from skeleton
data,” Thirty-First AAAI Conf. on Art. Intell., California, USA, 2017, pp.
4263 – 4270.
[22] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for
skeleton based action recognition,” IEEE Conf. on Computer Vision and
Pattern Recognition, Boston, MA, USA, 2015, pp. 1110–1118
[23] D. Cunado, M.S. Nixon, J.N. Carter, “Using gait as a biometric, via phase-
weighted magnitude spectra,” Int. Conf. on Audio and Video Based
Biometric Person Authentication, Berlin, Heidelberg, 1997. pp. 93–102.
[24] L. Wang, H. Ning, T. Tan, and W. Hu, “Fusion of static and dynamic body
biometrics for gait recognition,” IEEE Trans. Circuits Syst. Video
Technol., vol. 14, no. 2, pp.149–158, Mar. 2004.
[25] R.M. Araujo, G. Graña , and V. Andersson, “Towards skeleton biometric
identification using the Microsoft Kinect sensor,” ACM Symposium on
Applied Computing, Coimbra, Portugal, 2013, pp. 21-26.
[26] M. Schuster, and K.K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE Trans. on Signal Processing, vol. 45, no. 11, pp. 2673–2681, Nov.
1997.
[27] S. Ioffe, C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” 32nd Int. Conf. on Machine
Learning, Lille, France, 2015. pp. 448–456 .
[28] D.P. Kingma, J. Ba, “Adam: A method for stochastic optimization,” 3 rd
Int. Conf. on Learning Representations, San Diego, 2015.
[29] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning
approach for deep face recognition,” European Conf. on Computer Vision,
2016. pp. 499–515.
[30] M. Hofmann, J. Geiger, S. Bachmann, B. Schuller, and G. Rigoll, “The
TUM gait from audio, image and depth (gaid) database: Multimodal
recognition of subjects and traits,” Journal of Visual Communication and
Image Representation, vol. 25, no. 1, pp. 195–206, Jan. 2014.
[31] T. Noriko, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi, “Multi-
view large population gait dataset and its performance evaluation for
cross-view gait recognition,” IPSJ Transaction. on Computer Vision and
Applications, vol. 10, no. 1, pp. 4, Feb. 2018.
[32] S. Sarkar, P.J. Phillips, Z. Liu, I.R. Vega, P. Grother, and K.W. Bowyer
“The humanID gait challenge problem: data sets, performance, and
analysis,” IEEE Trans on Pattern Anal. and Mach. Intell., vol. 27, no. 2,
pp. 162–177, Feb. 2005.
[33] L. Wang, T. Tieniu, W. Hu, and H. Ning, “Automatic gait recognition
based on statistical shape analysis,” IEEE Trans. on Image Process., vol.
12, no. 9, pp. 1120 –1131, Sep. 2003.
[34] M. Goffredo, J.N. Carter, and M.S. Nixon, “Front-view gait recognition,”
IEEE Second International Conference on Biometrics: Theory,
Applications and Systems, Arlington, VA, USA, 2008. pp. 1– 6.
[35] D. Liu, M. Ye, X. Li, F. Zhang, and L. Lin, “Memory-based gait
recognition,” British Machine Vision Conference, 2016, pp. 82.1–82.12.
[36] V. C. de Lima, and W. R. Schwartz, "Gait recognition using pose
estimation and signal processing," Iberoamerican Congress on Pattern
Recognition, 2019, pp. 719–728.
[37] W. Kusakunniran, Q. Wu, H. Li, and J. Zhang, "Automatic gait recognition
using weighted binary pattern on video," Sixth IEEE International
Conference on Advanced Video and Signal Based Surveillance, Genova,
Italy, 2009 , pp. 49–54.
[38] S. Yu, H. Chen, Q. Wang, L. Shen, and Y. Huang, “Invariant feature
extraction for gait recognition using only one uniform model,”
Neurocomputing vol. 239, pp. 81–93, May 2017.
[39] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang, “Recognizing
gaits across views through correlated motion co-clustering,” IEEE Trans.
on Image Process., vol. 23, no. 2, pp. 696 – 709, Feb. 2014.
[40] W. Kusakunniran, Q. Wu, J. Zhang, and H. Li, “Support vector regression
for multi-view gait recognition based on local motion feature selection,”
in Proc. IEEE Int. Conf. CVPR, San Francisco, CA, USA, Jun. 2010, pp.
974–981.