ArticlePDF Available

American Sign Language Recognition and Training Method with Recurrent Neural Network

Authors:

Abstract and Figures

Though American sign language (ASL) has gained recognition from the American society, few ASL applications have been developed with educational purposes. Those designed with real-time sign recognition systems are also lacking. Leap motion controller facilitates the real-time and accurate recognition of ASL signs. It allows an opportunity for designing a learning application with a real-time sign recognition system that seeks to improve the effectiveness of ASL learning. The project proposes an ASL learning application prototype. The application would be a whack-a-mole game with a real-time sign recognition system embedded. Since both static and dynamic signs (J, Z) exist in ASL alphabets, Long-Short Term Memory Recurrent Neural Network with k-Nearest-Neighbour method is adopted as the classification method is based on handling of sequences of input. Characteristics such as sphere radius, angles between fingers and distance between finger positions are extracted as input for the classification model. The model is trained with 2600 samples, 100 samples taken for each alphabet. The experimental results revealed that the recognition rate for 26 ASL alphabets yields an average of 99.44% accuracy rate and 91.82% in 5-fold cross-validation with the use of leap motion controller.
Content may be subject to copyright.
1
American Sign Language Recognition and Training Method with Recurrent Neural
1
Network
2
C.K.M. LEE a, Kam K.H. NG b, Chun-Hsien CHEN c,*, H.C.W. LAU d, S.Y. CHUNG a, Tiffany
3
TSOIa
4
a Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong,
5
China
6
b Interdisciplinary Division of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hung
7
Hom, Hong Kong SAR, China
8
c School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore
9
639798, Singapore
10
d School of Business, The University of Western Sydney, Australia
11
12
* Corresponding author. Address School of Mechanical and Aerospace Engineering, Nanyang Technological University,
13
North Spine (N3), Level 2, 50 Nanyang Avenue, Singapore 639798, Singapore. Tel.: +65 6790 4888; fax: +65 6792
14
4062.
15
16
Email Address: ckm.lee@polyu.edu.hk (C.K.M. LEE), kam.kh.ng@polyu.edu.hk (Kam K.H. NG), MCHchen@ntu.edu.sg
17
(Chun-Hsien CHEN), H.Lau@westernsydney.edu.au (H.C.W. LAU), grace.sy.chung@polyu.edu.hk (S.Y. CHUNG),
18
tiffanytsy.tsoi@connect.polyu.hk (Tiffany TSOI)
19
20
21
22
Acknowledgment
23
The authors would like to express their gratitude and appreciation to the anonymous reviewers, the editor-in-
24
chief and editors of the journal for providing valuable comments for the continuing improvement of this article.
25
The research was supported by the Hong Kong Polytechnic University, Hong Kong and Nanyang Technological
26
University, Singapore and The University of Western Sydney, Australia.
27
28
Declarations of interest: none
29
30
31
2
American Sign Language Recognition and Training Method with Recurrent Neural
1
Network
2
3
Abstract
4
Though American sign language (ASL) has gained recognition from the American society, few ASL applications
5
have been developed with educational purposes. Those designed with real-time sign recognition systems are
6
also lacking. Leap motion controller facilitates the real-time and accurate recognition of ASL signs. It allows an
7
opportunity for designing a learning application with a real-time sign recognition system that seeks to improve
8
the effectiveness of ASL learning. The project proposes an ASL learning application prototype. The application
9
would be a whack-a-mole game with a real-time sign recognition system embedded. Since both static and
10
dynamic signs (J, Z) exist in ASL alphabets, Long-Short Term Memory Recurrent Neural Network with k-
11
Nearest-Neighbour method is adopted as the classification method is based on handling of sequences of input.
12
Characteristics such as sphere radius, angles between fingers and distance between finger positions are extracted
13
as input for the classification model. The model is trained with 2600 samples, 100 samples taken for each
14
alphabet. The experimental results revealed that the recognition rate for 26 ASL alphabets yields an average of
15
99.44% accuracy rate and 91.82% in 5-fold cross-validation with the use of leap motion controller.
16
17
18
Keywords: American Sign Language; Leap motion controller; Learning Application; Sign Recognition System
19
20
3
1. Introduction
1
1.1. Problem description
2
Sign languages are natural languages that have been developed through the evolution of contact between the
3
hearing impaired but not invented by any system (Napier & Leeson, 2016). They differ from spoken languages
4
in primarily two ways. First, sign languages are natural and mature languages are “articulated in visual-spatial
5
modality”, unlike spoken ones, that are presented in “oral-aural modality”. Second, Napier and Leeson (2016)
6
pointed out that sign languages employ two hands, facial muscles, the body and head and sometimes also involve
7
vocalisation. They are neither universal nor mutually intelligible (Beal-Alvarez, 2014). In other words, a sign
8
language that is developed in one region is not applicable in other regions and contains non-relevant varieties
9
that require special methods/techniques of acquisition. Currently, 141 types of sign languages exist worldwide
10
(Liddell & Johnson, 1989).
11
12
The American sign language (ASL) is the foremost used language for the deaf in the United States and English-
13
speaking regions of Canada (Napier, Leigh, & Nann, 2007). Though increasing recognition for ASL has boosted
14
confidence among the hearing impaired, the limited resources available has created social and cultural issues
15
among the hearing impaired communities, compared to the amount of linguistics research despite the amount
16
of linguistic research carried out in the field (Marschark & Spencer, 2010). In the United States, hearing
17
impaired and hard-of-hearing students can choose between attending residential (catering to only students who
18
are hearing impaired or hard-of-hearing) or public schools. As the integration of hearing impaired with peers
19
without hearing impairment is emphasised, an increasing number of hearing impaired students are enrolling in
20
public schools. However, they are placed in environments without adequate teaching support in most cases
21
(Marschark & Spencer, 2010).
22
23
To create an inclusive environment with hearing students and hearing impaired in public schools, promoting
24
ASL among the hearing public would be effective. With the implementation of ASL in schools, hearing teachers
25
and students can communicate through both linguistic and non-linguistics ways that can aid in creating an
26
interactive environment for hearing impaired and hard-of-hearing students and thus enhance the effectiveness
27
of academic learning. Furthermore, the promotion of ASL helps achieve the inclusion of the hearing impaired
28
in society through boosting learning motivation with educational applications. Being a feasible and economical
29
solution, the leap motion controller is commonly used as a device for sign recognition systems (Arsalan, Kim,
30
Owais, & Park, 2020; Elboushaki, Hannane, Afdel, & Koutti, 2020). However, there exists a research gap on
31
the adoption of leap motion controller in sign education purposes. A predominant section of the research only
32
examines the viability of different sign recognition models with the leap motion controller and does not extend
33
the model into an educational application that aids sign language learning and promotes sign languages. Only
34
Parreño, Celi, Quevedo, Rivas, and Andaluz (2017) have proposed a didactic game prototype for Ecuadorian
35
signs. Therefore, there is a paucity of research focusing on the development of educational applications for ASL
36
with the leap motion controller and investigating the effectiveness of such applications in improving sign
37
4
learning.
1
2
1.2. Contributions of the research
3
This research seeks to design an ASL learning application in game-learning and develop a real-time sign
4
recognition system with leap motion controller for the use of the application. The sign recognition environment
5
starts with identifying and extracting ASL’s sign features and by subsequently developing a suitable algorithm
6
for the recognition system. After applying the algorithm and training the network architecture, the system gains
7
the capacity to recognise and classify ASL signs into 26 alphabets. The classification using feature extraction
8
was processed by long-short term memory recurrent neural network (LSTM-RNN) with k-nearest neighbour
9
(kNN) method. Finally, the system will be integrated into the game environment in the ASL learning application.
10
This application is expected to promote ASL among the hearing impaired and the non-hearing impaired, thereby
11
motivating them to learn ASL by entertainment and engagement provided by the game environment and further
12
helping the hearing impaired to better integrate into society. Furthermore, it encourages and promotes the use
13
of ASL as a second language that is worthy of acquiring.
14
15
The contributions of the research can be summarised as follows:
16
The proposed LSTM-RNN with kNN method could recognise 26 alphabets with a recognition rate of
17
99.44% accuracy and 91.82% in 5-fold cross-validation using leap motion controller. The proposed
18
method outperforms other well-known algorithms in the literature.
19
Leap motion controller is a monochromatic-IR-cameras and three-infrared-LEDs based sensor to track
20
the 3D motion of hand gesture, including Palm centre, fingertip position, sphere radius and finger bone
21
positions for every 200 frames collected. Given that those data are available using a leap motion
22
controller, we could further extract the feature for the classification of ASL, which an application in our
23
study.
24
The programming flow of the proposed model was designed as a learning-based program. A game
25
module and recognition module are performed in real-time. We aim at promoting ASL in a learning-
26
based environment as our application.
27
28
1.3. The organisation of the paper
29
The rest of this article is organised as follows. Section 2 describes the literature review and section 3 illustrates
30
the proposed framework for the ASL learning application, including the game module and real-time recognition
31
system. Section 4 presents the validation results and analyses the performance of the proposed recognition
32
system. Section 5 summarises the research, including the conclusion, research contributions, limitations and
33
future development.
34
35
5
2. Literature review
1
2.1. Learning application
2
In terms of educational technology, knowledge acquisition in students can be improved through the fusion of
3
academic activities with interactive, collaborative and immersive technologies (Souza, 2015). Notably, several
4
studies have proposed new approaches that stimulate sign language mastering and knowledge acquisition by
5
promoting motivation and excitement in pedagogical activities. Parreño, et al. (2017) suggested that an
6
intelligent sign learning game-based system is more effective in the improvement of sign language skills. Pontes,
7
Duarte, and Pinheiro (2018) have also proposed an education digital game with the provision of a modular
8
software architecture that acts as a motivator in the Brazilian Sign Language learning process. Notably, modular
9
software architectures can allow adjustments to accommodate other sign languages (Rastgoo, Kiani, & Escalera,
10
2020). Furthermore, it is suggested that engagement is ensured when students concentrate and enjoy sign
11
learning via the game, which eventually improves learning performance among students (Kamnardsiri, Hongsit,
12
Khuwuthyakorn, & Wongta, 2017). In summary, educational games are proven to be effective tools in learning
13
sign languages and are further supported by the engagement, motivation and entertainment they warrant.
14
15
2.2. The comparison of sign recognition methods
16
Past research has suggested several methods for the recognition of ASL, including the usage of motion gloves,
17
Kinect Sensor, image processing with cameras and leap motion controllers. Oz and Leu (2011) developed an
18
artificial neural network model to track the 3D motion for 50 ASL words. Motion gloves for ASL recognition
19
are more expensive, have higher restrictions in terms of hand anatomy and are less comfortable for users
20
compared to vision-based methods. Moreover, it is time-consuming and may result in imprecise calibrations
21
caused by the wear and tear from repeated use of the gloves (Huenerfauth & Lu, 2010; Luzanin & Plancak,
22
2014; Oz & Leu, 2007). Due to sign complexities, constant finger occlusions, high interclass similarities and
23
significant interclass variations, the recognition of ASL signs is still remains a challenging task for Kinect
24
sensors used in isolation (Sun, Zhang, Bao, Xu, & Mei, 2013; Tao, Leu, & Yin, 2018). Furthermore, the
25
calibration of the sensory data are is also important. Several studies have focused on the measurement of angular
26
positions to predict the motion gestures (Fujiwara, Santos, & Suzuki, 2014). Tubaiz, Shanableh, and Assaleh
27
(2015) and Aly, Aly, and Almotairi (2019) suggested that an ASL recognition system could can be developed in
28
a user-dependent mode and proposed a modified kNN approach. Readers can refer to the review article on
29
sensory gloves for sign language recognition (Ahmed, Zaidan, Zaidan, Salih, & Lakulu, 2018). The sensing
30
board and wearable application for ASL recognition have been also been extensively studied in the literature (B.
31
G. Lee & Lee, 2018; Paudyal, Lee, Banerjee, & Gupta, 2019; Jian Wu & Jafari, 2017; J. Wu, Sun, & Jafari,
32
2016; J. Wu, Tian, Sun, Estevez, & Jafari, 2015).
33
34
Among all vision-based sign recognition methods, image processing is a low-cost, widely accessible and
35
effective option (Ciaramello & Hemami, 2011; Starner, Weaver, & Pentland, 1998); however, it requires a long
36
calculation to recognise hand and fingers, which results in a long interval before projecting the recognition result
37
6
(Khelil, et al., 2016). Furthermore, skin colour and lightning conditions are critical factors that severely affect
1
and hinder data accuracy (Bheda & Radpour, 2017). However, the leap motion controller in palm-size is a more
2
economical and portable solution than motion gloves or Kinect sensors discussed above (Chuan, Regina, &
3
Guardino, 2014). Fast processing, robustness and requirement of less memory are additional advantages for the
4
leap motion controller (Naglot & Kulkarni, 2016). However, the controller has an inconsistent sampling
5
frequency. It requires post-processing to reduce its effect on real-time recognition systems (Guna, Jakus,
6
Pogačnik, Tomažič, & Sodnik, 2014). The comparison of glove-based and vision-based methods of gesture
7
recognition application are shown in Table 1.
8
9
Table 1
10
Comparison between glove-based and vision-based methods
11
Factors
Motion Gloves
Vision-based Methods
User comfort
Less
High
Portability
Lower
Higher
Cost
Higher
Lower
Hand Anatomy
Low
High
Calibration
Critical
Not Critical
12
2.3. Structure and Recognition Framework of leap motion controller
13
The controller, comprised of infrared cameras and optical sensors, is used for sensing hand and finger
14
movements in 3D space. According to the sensor’s coordinate system, the position and speed of the palm and
15
fingers can be recognised with infrared imaging (Khelil, et al., 2016). The controller employs a right-handed
16
Cartesian coordinate system, which has the XYZ axes intersecting in the centre of the sensor as shown in Fig.
17
1. The controller can be programmed through the leap motion application programming interface (API).
18
Positioning and speed data are mentioned above and can be obtained through API.
19
20
21
Fig. 1. Orientation of leap motion controller
22
23
General sign recognition system with leap motion controller consists of the following essential steps: data
24
acquisition, feature extraction, classification and validation. Basically, a general recognition would start with a
25
7
sign recognised by the leap motion controller and then the data is sent for pre-processing. In the stage of data
1
acquisition, hand palm data and finger data can be acquired from the API. For the feature extraction, different
2
studies have defined and extracted features for sign recognition that proposed numerous methods to compute
3
feature vectors for further processing (Chong & Lee, 2018; Chuan, et al., 2014; Khelil, et al., 2016). Furthermore,
4
the classification and validation techniques used in the literature on sign recognition systems with leap motion
5
Controller are compared and the results are shown in Table 2.
6
7
Table 2
8
Comparison of sign recognition systems with leap motion controller on classification and validation techniques
9
Ref.
Number of Gestures
Classifier
Accuracy
(%)
(Danilo Avola, Bernardi, Cinque, Foresti,
& Massaroni, 2018)
30 ASL gestures
(12 dynamic signs and 18
static signs)
RNN
96.41
(Chong & Lee, 2018)
26 ASL gestures (A-Z)
DNN
93.81
36 ASL gestures (A-Z, 0-9)
88.79
(Chuan, et al., 2014)
26 ASL gestures (A-Z)
SVM
79.83
(Du, Liu, Feng, Chen, & Wu, 2017)
10 selected gestures
SVM
83.36
(Khelil, et al., 2016)
10 ASL gestures (0-9)
SVM
91.30
10
It is observed that the support vector machine (SVM) has been a popular classification method used over the
11
years in sign recognition systems with leap motion, and the use of neural network would be a newer
12
classification method (H. Lee, Li, Rai, & Chattopadhyay, 2020; Valente & Maldonado, 2020). Moreover,
13
different types of cross-validation techniques are used in model validation as well. Neural network, also called
14
deep neural network (DNN), is a type of deep learning and is commonly used for classification or regression
15
with success in different areas (Akyol, 2020; Zhong, et al., 2020). The predominant reason for neural networks
16
outperforming SVMs is the former’s ability to learn important features from any data structure and to handle
17
multiclass classification with a single neural network structure (Rojas, 1996). Artificial neural network is the
18
most commonly used type of neural network while recurrent neural network (RNN) is one of its categories,
19
whose connections between nodes would form a directed graph along temporal sequences (Asghari, Leung, &
20
Hsu, 2020; Jeong, et al., 2019; Liu, Yu, Yu, Chen, & Wu, 2020; Rojas, 1996). It demonstrates a temporal
21
dynamic behaviour that implies the function is time dependent. However, classic RNN is not able to handle a
22
long-time frame. long-short term memory (LSTM) is a special type of RNN that addresses the limitations of
23
classic RNN (Hochreiter & Schmidhuber, 1997). LSTM is effective in learning long-term dependencies. It is
24
suggested that constant error backpropagation within internal states contributes to its ability to bridge long time
25
lags (Hochreiter & Schmidhuber, 1997). Noise, continuous values and distributed representations can be
26
handled effectively by LSTM.
27
8
1
3. Methodology
2
The system conceptual framework is shown in Fig. 2 and consists of two running modules - game module and
3
the real-time sign recognition system. The proposed learning application is fundamentally, a special Whack-A-
4
Mole game. Rather than mouse-clicking, a question pertaining to ASL signs has to be accurately answered in
5
order to strike the mole. Each mole would come up from 7 holes randomly holding a stick, on which 1 out of
6
the 26 English alphabets is randomly printed. In the meantime, the appropriate hand configuration for the
7
corresponding ASL alphabet is shown on the upper left-hand corner as a hint. Users have to make the ASL sign
8
through the leap motion controller. Subsequently, real-time sign recognition also occurs. The real-time sign
9
recognition system is comprised of three phases: data acquisition, feature extraction and classification. First,
10
data acquisition happens with data that is directly extracted from the leap motion API. Next, some data has to
11
be further processed as features. Following this, the structured data can be input into the pre-trained
12
classification model for real-time recognition. Gestures would be classified into 1 of the 26 classes. If the
13
classification result matches with what is on the stick, accuracy is shown on the game interface. The mole would
14
be struck and a point would be added only if the accuracy rate is 80 or above. Otherwise, a miss would be
15
recorded. The time limit for each question would be half a minute and each trial of the game ends after 5
16
questions, which means that the steps in the conceptual framework are gone through 5 times. Fig. 3 illustrates
17
a scene in the game when a question is answered correctly through the leap motion controller.
18
19
The designed programming flow is shown in Fig. 4 and primarily consists of two scripts running synchronously,
20
i.e. Real-time Recorder and Gaming. When the application is initialised, the Real-time Recorder first creates a
21
file in the CSV format and initialises the real-time listener. The real-time listener continuously collects data
22
from the leap motion API. The sign language includes state and dynamic signs. Furthermore, leap motion is
23
sensitive to hand gesture motion and slightly motion change may be captured. Therefore, 30 features extraction
24
for 200 frames are considered to accommodate the hand gesture motion change and the nature of state and
25
dynamic signs of ASL. For every 200 frames collected, they are passed to RNN classifier for classification. The
26
classification results would be sent back to the Real-time Recorder for saving into the CSV file. On the other
27
hand, Game is synchronously running. When a mole comes up, it would continuously take the latest
28
classification result from the CSV file to determine whether the mole would be hit and to show the accuracy
29
score.
30
31
9
1
Fig. 2. Conceptual framework of the game modular based ASL recognition
2
3
4
Fig. 3. Questions answered correctly in the application developed
5
6
7
Fig. 4. Designed programming flow
8
9
10
1
3.1. Data Acquisition for ASL Recognition Using leap motion controller
2
A general recognition would start with a sign recognised by the leap motion controller; subsequently, data is
3
sent for pre-processing. Hand palm data, hand sphere radius and finger data are acquired. This is demonstrated
4
in Fig. 5.
5
6
Hand palm data includes unit vector of palm, position of palm centre, velocity of palm and palm normal (Naidu
7
& Ghotkar, 2016). In the meantime, hand palm sphere radius, grab strength and pinch strength can be obtained.
8
Hand palm sphere radius measures a sphere that matches the curvature of the hand. The line connecting the red
9
dots in Fig. 6. illustrates the diameter of the sphere and hence, half of it would be the radius. The grab strength
10
refers to the strength of showing a grab hand pose; for it, the value 0 represents an open hand and the value 1
11
represents a grab hand pose. Similarly, pinch strength lies between 0 and 1, where 0 means an open hand detected
12
and 1 means pinch hand pose recognised. Pinching can be done with the thumb and any other finger.
13
14
15
Fig. 5. Palm centre and fingertip position
16
17
11
1
Fig. 6. Sphere radius
2
3
Fig. 7. Finger bone positions
4
5
The finger data carries the direction and length of each finger, tip velocity and position of joints as stated in Fig.
6
7. Other than fingertip positions, the positions of joints between the distal bones, intermediate bones, proximal
7
bones and metacarpal bones can be obtained (Khelil, et al., 2016).
8
9
We referred to the feature extraction methods for leap motion controller proposed by Chong and Lee (2018).
10
The following features extracted are used to describe palm flexion, hand movement, relation of palm and
11
fingertips, as well as the relation between fingertips.
12
13
The standard deviation of palm position (S) can be calculated using (1), where P represents the position of the
14
palm centre and N denotes the size of the dataset.
15
16
12


(1)
1
Palm sphere radius (R) can be computed as shown in Equation (2), where represents the positions of the
2
fingertips, represent thumb, index, middle, ring and little fingers respectively.
3
4

(2)
5
The angles between 2 adjacent fingers (A) can be calculated with Equation (3). Note that the angle between the
6
thumb and the little finger is excluded due to the inclusiveness of palm curliness, which is included in R.
7
8

(3)
9
Distance between all the fingers (L), with 2 in a group in a total of 10 groups, is computed according to (4).
10
and represents all fingertips 1 to 5, while 
11
12

(4)
13
3.2. Training of Sign Recognition Model by Feature Extraction
14
Real-time sign recognition requires a pre-trained classification model. First, the data samples should be taken
15
as input for the training of the model. Thus, model training would commence by collecting raw data from the
16
leap motion API. Since ASL signs are featured by relative positions and angles between the palm and fingers,
17
both palm and finger data are vital. Thus, data in Table 3 was collected for the proposed work. The front and
18
rear views of ASL on leap motion are presented in Fig. 8 and Fig. 9, respectively.
19
20
Table 3
21
Data extracted in proposed work
22
Data
Details
(a) Position of palm centre
X, Y and Z coordinates of the palm centre are extracted as 3 separate data.
(b) Unit vector of palm normal
A vector pointing perpendicular to the palm direction
(c) Sphere radius
The radius of the sphere that matches curvature of a hand
(d) Grab strength
Strength of being a grab hand pose [0,1]
(e) Pinch strength
Strength of being a pinch hand pose [0,1]
(f) Fingertip positions
Positions of thumb, index, middle, ring and little fingertips are extracted in radian.
(g) Fingertip directions
Directions of thumb, index, middle, ring and little fingertips are extracted in radian.
13
1
2
Fig. 8. Front view of American sign language on leap motion
3
4
14
1
Fig. 9. Rear view of American sign language on leap motion
2
3
For feature extraction, some raw data was directly used as features such as (a), (c), (d) and (e). The others were
4
further processed into features. Finally, 30 features were generated, as shown in Table 4. A total of 2600 data
5
samples were collected, among which 100 samples for each of the 26 alphabets were collected for training the
6
model. Each sample is constituted of 200 frames of these 30 features. Only right-hand samples were collected.
7
Since the frame rate varies based on the different computing resources and activities performed, 110 frames
8
were collected in this work in a second with the computing resources and the environment by approximation.
9
Subsequently, the 2600 data samples were piled into a file in npy format of sizes of (2600, 200, 30). A set of
10
labels was also created for identifying data samples’ classes. This is an npy file of size (2600, 26).
11
12
13
15
Table 4
1
Features extracted for model training
2
1
Palm centre position X
2
Palm centre position Y
3
Palm centre position Z
4
Sphere radius (mm)
5
Grab strength [0,1]
6
Pinch strength [0,1]
7
Distance between palm centre position and thumb tip position
8
Distance between palm centre position and index tip position
9
Distance between palm centre position and middle tip position
10
Distance between palm centre position and ring tip position
11
Distance between palm centre position and little tip position
12
The angle between thumb normal and thumb tip direction (radian)
13
The angle between thumb normal and index tip direction (radian)
14
The angle between thumb normal and middle tip direction (radian)
15
The angle between thumb normal and ring tip direction (radian)
16
Angle between thumb normal and little tip direction (radian)
17
Distance between thumb tip position and index tip position
18
Distance between thumb tip position and middle tip position
19
Distance between thumb tip position and ring tip position
20
Distance between thumb tip position and little tip position
21
Distance between index tip position and middle tip position
22
Distance between index tip position and ring tip position
23
Distance between index tip position and little tip position
24
Distance between middle tip position and ring tip position
25
Distance between middle tip position and little tip position
26
Distance between ring tip position and little tip position
27
The angle between thumb tip direction and index tip direction
(radian)
28
The angle between index tip direction and middle tip direction
(radian)
29
The angle between middle tip direction and ring tip direction
(radian)
30
The angle between ring tip direction and little tip direction (radian)
3
The proposed model consists of 3 layers after the input layer as shown in Fig. 10.
4
5
6
Fig. 10. Proposed classifier model
7
8
First, the LSTM layer is selected due to its capability for handling data in a long-time frame that is constituted
9
of 28 neurons. For the algorithmic structure of LSTM, the readers can refer to the work by Goyal, Pandey, and
10
Jain (2018). Three parameters are to be determined: batch size, number of epochs and units for LSTM. Batch
11
size refers to number of samples for training each time. Apparently, larger batch size results in a model with
12
16
lower accuracy while smaller batch size requires much more training time which would not be efficient enough.
1
Number of epochs represents number of passes over the entire dataset. After each epoch, evaluation is made
2
and weights in neural network are updated. With more epochs trained, the model should be more accurate.
3
However, model with too many epochs trained would appear to be overfitting. Overfitting appears when the
4
model predicts data in an unnecessarily complicated way. In other words, it fits known data well yet is less
5
successful in fitting subsequent data than a simpler model. For units in LSTM, it refers to the dimensionality of
6
LSTM output space. It can also be seen as number of neurons in the layer. It is hard to determine whether larger
7
or smaller size of units would be better. Every model with different features is optimised by differing number
8
of units.
9
10
The final step before model training would be the selection of model parameters. Three parameters are to be
11
determined: batch size, number of epochs and number of units for LSTM. Batch size refers to the number of
12
samples for training each time, whereas the number of epochs represents the number of passes over the entire
13
dataset. For units in LSTM, it refers to the dimensionality of LSTM output space. It can also be considered as
14
the number of neurons in the layer. To determine the most effective parameters, “gridsearchCV” function from
15
“sklearn” library in Python was used. It is observed that the units of LSTM, batch size and number of epochs
16
are selected between 28 and 30, 32 and 64, 30 and 40 respectively. Table 5 shows a model grid created after
17
applying function (5).
18
19


(5)
20
Table 5
21
Model grid for selection of parameters
22
Units: 28
Units: 30
Batch size
Batch size
Number of epochs
32
64
32
64
30
0.094
0.098
0.091
0.077
40
0.135
0.100
0.120
0.101
23
It is illustrated that units of 28, batch size of 32 and number of epochs of 40 would be the best parameters
24
optimising model performance. Hence, epochs of 80 times are selected for the final model to improve the
25
accuracy. The selected model parameters were also input. Finally, the model is trained and was output in h5
26
format for use in real-time sign recognition.
27
28
After selecting the above parameters, the loss function should be selected for compiling the model to optimise
29
its performance. Categorical cross-entropy, a multi-class logarithmic loss, is selected. For the proposed model,
30
it was created based on the training set. Categorical cross-entropy was measured on the test set to evaluate the
31
17
accuracy of the model in the predictions. Cross-entropy, used as an alternative to squared error, is an error
1
measure intended for network with output representing independent hypotheses and node activations
2
representing a probability of each hypothesis being true. In the case, output vector is a probability distribution
3
and cross-entropy is used as an indication of distance between what the network predicts for the result of the
4
distribution and the “actual answer” for the distribution. The equation for categorical cross-entropy in Keras is
5
suggested below (Gulli & Pal, 2017).
6
7

(6)
,where  is the target and  refers to the prediction.
8
9
Another parameter to be selected in compiling the model is the optimiser. The selected optimiser, Adam, is a
10
gradient-based optimisation of stochastic objective functions. It functions on the basis of lower-order moment
11
estimation. It is different from classical ones by maintaining a single learning rate for all weight adjustments
12
during the entire training process (Kingma & Ba, 2014). However, the method adapts different learning rates
13
for different parameter selections by estimation of first and second moments of gradient. Kingma and Ba (2014)
14
also suggested that Adam combines the advantages of Adaptive Gradient Algorithm and Root Mean Square
15
Propagation. Adaptive Gradient Algorithm is great in handling sparse gradient problems while Root Mean
16
Square Propagation does well on non-stationary problems. Adam possesses both of the advantages. Adam is the
17
most appropriate choice of optimiser for the proposed model due to the following reasons. It is computationally
18
efficient and hence has a low memory requirement. It is well-designed for handling problems with large amounts
19
of data. Finally, it is capable of managing dynamic objectives as well as problems with lots of noise.
20
21
Besides, the Lambda layer in the middle would be a K-means clustering layer. The algorithm proposed by
22
Vassilvitskii (2007) would assign N data points into 1 of the K clusters. The pseudo-code of the K-mean
23
clustering algorithm is shown in (Vassilvitskii, 2007)Error! Reference source not found.. K-mean clustering
24
is opted for the second layer since it is an efficient clustering method for handling multi-class classification.
25
With supervised and unsupervised learning in the same model, the model would optimise advantages from both
26
sides. Furthermore, the k-mean clustering compressed the 200 frames to obtain the centre point of feature
27
extracted for model training mentioned in Table 4. This can accommodate different hand size and motion
28
changes in the 200 frames, especially the relative coordinates between finger, distal, intermediate, proximal and
29
metacarpal.
30
31
Third, the final layer before the output of the result would be a Dense layer, which is a classic fully connected
32
layer. A Softmax function, which is logistic regression, is often used as the output function of the network. The
33
log odd ratios calculated would be the probabilities of each class in multiclass classification. The Dense layer
34
18
is selected as the final layer to transform group predictions into class probabilities for output.
1
2
Algorithm 1
3
Algorithm of k-mean clustering (Aly, et al., 2019)
4
1. Randomly chose k initial centres
2. repeat
3. For each  set to be the set of points in that are
closer to
than for any . {Assignment Step}
4. For each  set
 {Means Step}
5. Until does not change
5
6
3.3. Model validation
7
Cross-validation, a method that separates the dataset into S folds, is selected. Since data in the proposed model
8
is neither scarce nor expensive in extraction, general 5-fold cross-validation was used. 80% and 20% of the
9
dataset would be used for training and validating respectively in each trial (Refaeilzadeh, Tang, & Liu, 2009).
10
First, the data set was divided into 5 groups (folds), and a total of 5 trials is conducted. For each trial, one of the
11
folds was assigned as the testing set, while the rest were assigned as the training sets. Subsequently, the model
12
was trained with the training sets and validation took place in the testing set. For validation in each trial, the
13
overall accuracy and a confusion matrix for 26 classes were extracted. The 26-class confusion matrix is further
14
produced into another matrix containing true positive (TP), true negative (TN), false positive (FP) and false
15
negative (FN), as explained in Table VII.
16
17
TP, TN, FP and FN calculated for each class can be used for generating accuracy (ACC), sensitivity (Se) and
18
specificity (Sp) for each class. Accuracy refers to the ability of the model to correctly identify instances.
19
Sensitivity is the proportion of “real” positives that are accurately identified as positives, while specificity is the
20
proportion of “real” negatives that are correctly identified as negatives by the model. The equations of accuracy,
21
sensitivity and specificity are expressed in terms of TP, TN, FP and FN as follows. TP, TN, FP and FN can also
22
be used for generating the Matthews correlation coefficient (MCC) (Boughorbel, Jarray, & El-Anbari, 2017),
23
Fowlkes-Mallows index (FM) (Campello, 2007) and Bookmaker informedness (BM) (Fluss, Faraggi, & Reiser,
24
2005) for proving each class statistical significance. MCC is used for measuring the observed and predicted
25
binary classification (Boughorbel, et al., 2017). FM is used for measuring the similarity between the observed
26
and predicted binary classification (Campello, 2007). BM is used for estimating the probability of an informed
27
decision (Fluss, et al., 2005).
28
29


(7)
19


(8)


(9)


(10)



(11)

(12)
1
3.4. Dataset and experimental environment
2
Since there are no public datasets available for ASL training under a gaming environment, we recruited 100
3
participants to train the algorithms. 63 females and 37 males are recruited aged 20 to 30 years, and all
4
participants declared that they are right-handed people. The dataset composed of 26 alphabet data and 100
5
sample size for each alphabet from 100 participants. Therefore, 2600 sample size for 26 alphabet data are
6
obtained. As the gaming environment targets for ASL learning, the 100 participants do not have any formal
7
training of ASL before. Before the data collection, an ASL experienced person will present the right ASL hand
8
gesture to the participant several time. If the participants can present the right ASL hand gesture for 26 alphabets
9
after the learning stage, the participants will present their ASL and the leap motion will collect their hand gesture
10
data at the same time.
11
12
4. Results and discussion
13
With cross-validation, the comprehensive performance of the model can be evaluated before the output as the
14
real-time sign recognition module of the game. In this session, 5-fold cross-validation was performed and the
15
overall accuracy of the model is estimated to be 91.8%, averaging the 5 trials. The result is shown in Table 6.
16
17
Table 6
18
Model accuracy
19
Accuracy (%)
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
AVG
STDEV
92.88
92.88
91.15
90.96
91.15
91.80
0.99
20
Meanwhile, 26-class confusion matrices for the 5 trials were generated and were further transformed into
21
matrices of TP, TN, FP and FN. Accuracy, sensitivity and specificity were calculated as a result. To accurately
22
analyse the results, an average of over 5 trials were taken for accuracy, sensitivity and specificity for each
23
alphabet as shown in Table 7. Per-class accuracy and specificity for the model were calculated to be over 98%,
24
20
which implies that the model has a high probability incorrectly identifying negative results in each of the 26
1
classes; the proportion of accurately identified instances would be high as a result. Sensitivity only attains over
2
80%, except for the alphabet signs for M, N and S. It shows that the model has relatively poor chances of
3
identifying positive results. We also compare the results with other well-known methods in ASL classification,
4
including LSTM, SVM and RNN. Readers can refer to the algorithmic structures of LSTM (D. Avola, Bernardi,
5
Cinque, Foresti, & Massaroni, 2019), SVM (Chong & Lee, 2018) and RNN (Danilo Avola, et al., 2018). All the
6
algorithms in the numerical experiments achieve better accuracy results in the class F, K, V, W and Y. The
7
proposed method in predicting other classes outperforms LSTM, SVM and RNN. The average accuracy of the
8
proposed method, LSTM, SVM and RNN is 99.44%, 98.36%, 97.23% and 96.83%, respectively. The proposed
9
method obtained a fair good prediction statistically of the two-class classification, a greater similarity between
10
the observed and predicted binary classifications and higher probability of estimating an informed decision
11
comparing to LSTM, SVM and RNN. The statistical significance is introduced in Table 8. Therefore, we can
12
conclude that the proposed method outperforms LSTM, SVM and RNN in the numerical analysis.
13
14
Table 7
15
Average accuracy, sensitivity and specificity for 26 classes
16
Proposed method
LSTM
SVM
RNN
Class
ACC
Se
Sp
ACC
Se
Sp
ACC
Se
Sp
ACC
Se
Sp
A
99.92%
100.00%
99.92%
97.96%
83.00%
98.56%
98.35%
80.00%
99.08%
98.19%
84.00%
98.76%
B
99.96%
100.00%
99.96%
97.96%
64.00%
99.32%
97.42%
78.00%
98.20%
97.12%
66.00%
98.36%
C
99.73%
95.00%
99.92%
97.85%
88.00%
98.24%
97.46%
62.00%
98.88%
97.08%
64.00%
98.40%
D
99.42%
89.00%
99.84%
97.85%
56.00%
99.52%
97.58%
76.00%
98.44%
96.96%
70.00%
98.04%
E
99.85%
99.00%
99.88%
97.73%
85.00%
98.24%
95.73%
63.00%
97.04%
95.19%
69.00%
96.24%
F
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
96.38%
58.00%
97.92%
96.00%
40.00%
98.24%
G
99.77%
100.00%
99.76%
97.73%
56.00%
99.40%
96.69%
40.00%
98.96%
96.62%
56.00%
98.24%
H
99.81%
98.00%
99.88%
97.85%
83.00%
98.44%
97.54%
76.00%
98.40%
96.19%
45.00%
98.24%
I
99.89%
97.00%
100.00%
97.85%
61.00%
99.32%
96.96%
63.00%
98.32%
96.62%
56.00%
98.24%
J
99.89%
100.00%
99.88%
97.88%
81.00%
98.56%
97.04%
50.00%
98.92%
96.88%
56.00%
98.52%
K
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
96.00%
68.00%
97.12%
96.38%
57.00%
97.96%
L
99.39%
93.00%
99.64%
97.77%
88.00%
98.16%
97.00%
50.00%
98.88%
96.65%
48.00%
98.60%
M
98.35%
71.00%
99.44%
97.77%
54.00%
99.52%
96.77%
48.00%
98.72%
97.00%
49.00%
98.92%
N
98.08%
68.00%
99.28%
97.88%
64.00%
99.24%
97.15%
45.00%
99.24%
96.38%
38.00%
98.72%
O
99.27%
92.00%
99.56%
97.88%
83.00%
98.48%
97.31%
86.00%
97.76%
96.73%
73.00%
97.68%
P
99.85%
98.00%
99.92%
97.88%
62.00%
99.32%
97.35%
73.00%
98.32%
96.73%
57.00%
98.32%
Q
99.35%
88.00%
99.80%
97.85%
82.00%
98.48%
97.15%
48.00%
99.12%
97.19%
62.00%
98.60%
R
98.77%
69.00%
99.96%
97.85%
62.00%
99.28%
97.12%
80.00%
97.80%
96.19%
71.00%
97.20%
S
98.50%
78.00%
99.32%
97.88%
81.00%
98.56%
97.23%
55.00%
98.92%
96.38%
43.00%
98.52%
T
98.08%
87.00%
98.52%
97.88%
64.00%
99.24%
97.85%
65.00%
99.16%
96.58%
51.00%
98.40%
U
98.69%
98.00%
98.72%
98.04%
74.00%
99.00%
98.00%
80.00%
98.72%
96.81%
48.00%
98.76%
V
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
97.19%
67.00%
98.40%
96.27%
54.00%
97.96%
W
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
97.46%
59.00%
99.00%
96.35%
54.00%
98.04%
X
99.54%
99.00%
99.56%
98.04%
75.00%
98.96%
97.31%
62.00%
98.72%
98.00%
69.00%
99.16%
Y
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
97.27%
63.00%
98.64%
98.50%
79.00%
99.28%
Z
99.92%
99.00%
99.96%
100.00%
100.00%
100.00%
98.62%
68.00%
99.84%
98.69%
71.00%
99.80%
21
Avg
99.44%
93.00%
99.72%
98.36%
78.69%
99.15%
97.23%
63.96%
98.56%
96.83%
58.85%
98.35%
StdEv
0.64%
10.26%
0.39%
0.92%
15.76%
0.63%
0.62%
12.41%
0.64%
0.78%
12.10%
0.67%
1
Table 8
2
Statistical Significance for 26 classes
3
Proposed method
LSTM
SVM
RNN
Class
MCC
FM
BM
MCC
FM
BM
MCC
FM
BM
MCC
FM
BM
A
98.97%
99.00%
96.12%
75.05%
76.09%
81.56%
77.97%
78.83%
79.08%
77.41%
78.33%
82.76%
B
99.48%
99.50%
96.12%
70.09%
71.11%
63.32%
69.03%
70.33%
76.20%
62.31%
63.80%
64.36%
C
96.32%
96.46%
91.08%
75.55%
76.59%
86.24%
64.05%
65.35%
60.88%
61.24%
62.76%
62.40%
D
92.07%
92.37%
85.07%
66.90%
67.91%
55.52%
69.62%
70.87%
74.44%
62.61%
64.17%
68.04%
E
97.70%
97.78%
95.34%
73.72%
74.84%
83.24%
51.68%
53.82%
60.04%
51.76%
54.04%
65.24%
F
100.00%
100.00%
96.15%
100.00%
100.00%
100.00%
53.42%
55.30%
55.92%
41.59%
43.64%
38.24%
G
97.22%
97.33%
95.62%
65.37%
66.46%
55.40%
47.63%
49.24%
38.96%
54.24%
56.00%
54.24%
H
97.34%
97.44%
94.09%
74.06%
75.14%
81.44%
69.30%
70.56%
74.40%
45.73%
47.70%
43.24%
I
98.35%
98.41%
93.19%
68.00%
69.07%
60.32%
59.90%
61.48%
61.32%
54.24%
56.00%
54.24%
J
99.25%
99.32%
91.54%
73.80%
74.88%
79.56%
55.49%
56.98%
48.92%
56.46%
58.07%
54.52%
K
100.00%
100.00%
96.15%
100.00%
100.00%
100.00%
55.48%
57.47%
65.12%
52.97%
54.85%
54.96%
L
91.60%
91.92%
88.74%
74.94%
76.02%
86.16%
55.10%
56.61%
48.88%
50.98%
52.69%
46.60%
M
76.17%
77.01%
66.62%
65.44%
66.47%
53.52%
52.03%
53.67%
46.72%
54.71%
56.21%
47.92%
N
72.35%
73.33%
63.46%
69.18%
70.25%
63.24%
54.91%
56.25%
44.24%
43.63%
45.42%
36.72%
O
90.27%
90.65%
87.73%
74.39%
75.45%
81.48%
70.89%
72.17%
83.76%
62.14%
63.78%
70.68%
P
97.81%
97.89%
94.16%
68.70%
69.76%
61.32%
66.71%
68.07%
71.32%
55.59%
57.29%
55.32%
Q
90.83%
91.16%
83.88%
73.76%
74.86%
80.48%
55.98%
57.37%
47.12%
61.49%
62.95%
60.60%
R
81.93%
82.47%
65.12%
68.24%
69.32%
61.28%
67.43%
68.85%
77.80%
57.91%
59.79%
68.20%
S
79.25%
80.03%
73.50%
73.80%
74.88%
79.56%
59.33%
60.74%
53.92%
46.24%
48.08%
41.52%
T
76.98%
77.93%
81.64%
69.18%
70.25%
63.24%
68.99%
70.09%
64.16%
51.69%
53.46%
49.40%
U
86.21%
86.83%
92.76%
73.35%
74.37%
73.00%
74.56%
75.59%
78.72%
52.39%
54.00%
46.76%
V
100.00%
100.00%
96.15%
100.00%
100.00%
100.00%
63.31%
64.77%
65.40%
50.76%
52.70%
51.96%
W
100.00%
100.00%
96.15%
100.00%
100.00%
100.00%
63.08%
64.37%
58.00%
51.31%
53.21%
52.04%
X
93.77%
94.00%
94.92%
73.61%
74.63%
73.96%
62.55%
63.95%
60.72%
71.70%
72.73%
68.16%
Y
100.00%
100.00%
96.15%
100.00%
100.00%
100.00%
62.55%
63.97%
61.64%
79.43%
80.21%
78.28%
Z
99.03%
99.07%
94.88%
100.00%
100.00%
100.00%
79.51%
80.14%
67.84%
80.83%
81.44%
70.80%
Avg
92.80%
93.07%
88.71%
77.97%
78.78%
77.84%
62.71%
64.11%
62.52%
57.36%
58.97%
57.20%
4
On the other hand, model accuracy was assumed to be significantly below the expectation for 40 epochs of
5
training and thus 80 times of training was selected. To evaluate the suitability of selection, a graph was plotted
6
on model accuracy over epochs, as shown in Fig. 11. As can be observed, the accuracy of the model increases
7
with the increased number of epochs and the graph for testing set eventually goes flat between the 70th and 80th
8
epochs. Contrarily, model loss decreased significantly in the first 20 epochs and subsequently decreases in loss
9
narrows but continues as shown in Fig. 12. The graph of loss for the testing set eventually goes flat just before
10
80th epoch. Thus, 80 epochs for training the model is shown to be optimising.
11
12
22
1
Fig. 11. Model accuracy over Epochs
2
3
4
Fig. 12. Model loss over Epochs
5
6
The proposed work using RNN with 26 alphabets is compared with other literature proposing sign recognition
7
systems with leap motion controller. First, it is observed that the proposed work has generally stronger
8
performance than those that previously employed SVM. Compared to other models that proposed employing
9
the neural network, this undertaking has slightly higher accuracy. It specifically outperforms those that
10
employed SVM as their classification method; this can probably be attributed to neural network’s higher ability
11
to handle large datasets.
12
13
In this research, we considered the leap motion controller for ASL recognition. Compared to image processing
14
approach, the Leap Motion controller offers a quick hand gesture detection and captures the change of hand
15
gestures in real-time with less computational power. Image processing using conventional cameras may require
16
a high-level computer specification. In contrast, Leap Motion Control does not require a high-level computer
17
specification, and most of the hand gestures and motion are detected using Infrared LEDs and cameras and
18
23
output to the computer units for secondary processing. The primary restriction of leap motion controller is the
1
exposed regions is from frog’s eye view, as the leap motion controller must place on a surface. One may consider
2
the integrated approach using leap motion controller and conventional cameras from different angles to achieve
3
better accuracy of classification using agent-based modelling.
4
5
6
5. Concluding remarks
7
Sign recognitions in real-life applications are challenging due to the requirements of accuracy, robustness and
8
efficiency. This project explored the viability of a real-time sign recognition system embedded in an ASL
9
learning application. The proposed system involves the classification of 26 ASL alphabets and 30 selected
10
features for the training of the model. The RNN model is selected since dynamic signs J and Z require the
11
process of sequences of input. The overall accuracy of the model in the proposed work is 91.8%, which would
12
sufficiently indicate the reliability of the approach for American Sign Language recognition. On the other hand,
13
the Leap Motion Controller is a feasible and accurate method for ASL sign recognition. A significant amount of
14
previous research has proposed sign recognition systems that utilise Leap Motion Controller; however, very few
15
of them have further developed these systems into educational applications. This work fills this research gap
16
and can subsequently open up more opportunities in the form of teaching other sign languages as well.
17
Furthermore, the learning application can help promote ASL with its attractiveness in interactions and
18
entertainment. In particular, the use of the application in sign instructions in schools is expected to enhance the
19
learning motivation of hearing students in ASL and stimulate communication between hearing and hearing
20
impaired/ hard-of-hearing students. Several suggestions are made regarding potential areas of research. A more
21
mature application model can be produced by collecting samples from ASL users and developing more features
22
for training the model in order to accurately classify the signs M, N and S, thereby addressing the low sensitivity
23
of the 3 alphabets caused by thumb features. Replace if applicable to ensure clarity
24
25
The study has several limitations. First, the position, angle, and number of users of leap motion will affect the
26
accuracy of the model. The leap motion controller can detect several hand gestures, but the proposed method is
27
restricted to recognise only one hand gesture. The leap motion controller must keep flat in order to recognise
28
the ASL. Second, the present prototype only considers and is trained with samples from the right hand. The
29
samples are expected to be extended to include the left hand, so that the application can also be utilised by the
30
left-handed.
31
32
Several future works are presented to foster the relevant studies in ASL recognition. First, readers may consider
33
the modification of the algorithmic structure, such as different types of SoftMax function, different classifiers
34
in ASL recognition. Second, the current method is limited to leap motion controller. Readers may realise other
35
ASL recognition methods, including image processing, video processing and deep learning approaches. The
36
integrated approach with leap motion controller could achieve better computational accuracy using agent-based
37
24
modelling. Third, the non-contactless approaches using hand gesture and motion detection can also be extended
1
to other expert system and engineering applications in interaction design.
2
3
25
References
1
Ahmed, M. A., Zaidan, B. B., Zaidan, A. A., Salih, M. M., & Lakulu, M. M. b. (2018). A review on systems-based sensory
2
gloves for sign language recognition state of the art between 2007 and 2017. Sensors, 18, 2208.
3
Akyol, K. (2020). Comparing of deep neural networks and extreme learning machines based on growing and pruning
4
approach. Expert Systems with Applications, 140, 112875.
5
Aly, W., Aly, S., & Almotairi, S. (2019). User-Independent American Sign Language Alphabet Recognition Based on Depth
6
Image and PCANet Features. IEEE Access, 7, 123138-123150.
7
Arsalan, M., Kim, D. S., Owais, M., & Park, K. R. (2020). OR-Skip-Net: Outer residual skip network for skin segmentation
8
in non-ideal situations. Expert Systems with Applications, 141, 112922.
9
Asghari, V., Leung, Y. F., & Hsu, S.-C. (2020). Deep neural network based framework for complex correlations in
10
engineering metrics. Advanced Engineering Informatics, 44, 101058.
11
Avola, D., Bernardi, M., Cinque, L., Foresti, G. L., & Massaroni, C. (2018). Exploiting recurrent neural networks and leap
12
motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Transactions on
13
Multimedia, 21, 234-245.
14
Avola, D., Bernardi, M., Cinque, L., Foresti, G. L., & Massaroni, C. (2019). Exploiting Recurrent Neural Networks and
15
Leap Motion Controller for the Recognition of Sign Language and Semaphoric Hand Gestures. IEEE Transactions
16
on Multimedia, 21, 234-245.
17
Beal-Alvarez, J. S. (2014). Deaf students’ receptive and expressive American Sign Language skills: Comparisons and
18
relations. Journal of deaf studies and deaf education, 19, 508-529.
19
Bheda, V., & Radpour, D. (2017). Using deep convolutional networks for gesture recognition in American sign language.
20
arXiv preprint arXiv:1710.06836.
21
Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation
22
Coefficient metric. PloS one, 12, e0177678.
23
Campello, R. J. G. B. (2007). A fuzzy extension of the Rand index and other related indexes for clustering and classification
24
assessment. Pattern Recognition Letters, 28, 833-841.
25
Chong, T.-W., & Lee, B.-G. (2018). American sign language recognition using leap motion controller with machine
26
learning approach. Sensors, 18, 3554.
27
Chuan, C.-H., Regina, E., & Guardino, C. (2014). American sign language recognition using leap motion sensor. In 2014
28
13th International Conference on Machine Learning and Applications (pp. 541-544): IEEE.
29
Ciaramello, F. M., & Hemami, S. S. (2011). A Computational Intelligibility Model for Assessment and Compression of
30
American Sign Language Video. IEEE Transactions on Image Processing, 20, 3014-3027.
31
Du, Y., Liu, S., Feng, L., Chen, M., & Wu, J. (2017). Hand Gesture Recognition with Leap Motion. arXiv preprint
32
arXiv:1711.04293.
33
Elboushaki, A., Hannane, R., Afdel, K., & Koutti, L. (2020). MultiD-CNN: A multi-dimensional feature learning approach
34
based on deep convolutional networks for gesture recognition in RGB-D image sequences. Expert Systems with
35
Applications, 139, 112829.
36
Fluss, R., Faraggi, D., & Reiser, B. (2005). Estimation of the Youden Index and its Associated Cutoff Point. Biometrical
37
Journal, 47, 458-472.
38
Fujiwara, E., Santos, M. F. M. d., & Suzuki, C. K. (2014). Flexible Optical Fiber Bending Transducer for Application in
39
Glove-Based Sensors. IEEE Sensors Journal, 14, 3631-3636.
40
Goyal, P., Pandey, S., & Jain, K. (2018). Deep Learning for Natural Language Processing: Creating Neural Networks with
41
Python: Apress.
42
Gulli, A., & Pal, S. (2017). Deep learning with Keras: implement neural networks with Keras on Theano and TensorFlow:
43
Packt Publishing.
44
Guna, J., Jakus, G., Pogačnik, M., Tomažič, S., & Sodnik, J. (2014). An analysis of the precision and reliability of the leap
45
motion sensor and its suitability for static and dynamic tracking. Sensors, 14, 3702-3720.
46
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9, 1735-1780.
47
Huenerfauth, M., & Lu, P. (2010). Accurate and accessible motion-capture glove calibration for sign language data
48
collection. ACM Transactions on Accessible Computing (TACCESS), 3, 2.
49
Jeong, S., Ferguson, M., Hou, R., Lynch, J. P., Sohn, H., & Law, K. H. (2019). Sensor data reconstruction using bidirectional
50
recurrent neural network with application to bridge monitoring. Advanced Engineering Informatics, 42, 100991.
51
Kamnardsiri, T., Hongsit, L.-o., Khuwuthyakorn, P., & Wongta, N. (2017). The Effectiveness of the Game-Based Learning
52
System for the Improvement of American Sign Language Using Kinect. Electronic Journal of e-Learning, 15,
53
283-296.
54
Khelil, B., Amiri, H., Chen, T., Kammüller, F., Nemli, I., & Probst, C. (2016). Hand gesture recognition using leap motion
55
controller for recognition of arabic sign language. Lect. Notes Comput. Sci.
56
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
57
Lee, B. G., & Lee, S. M. (2018). Smart Wearable Hand Device for Sign Language Interpretation System With Sensors
58
Fusion. IEEE Sensors Journal, 18, 1224-1232.
59
26
Lee, H., Li, G., Rai, A., & Chattopadhyay, A. (2020). Real-time anomaly detection framework using a support vector
1
regression for the safety monitoring of commercial aircraft. Advanced Engineering Informatics, 44, 101071.
2
Liddell, S. K., & Johnson, R. E. (1989). American sign language: The phonological base. Sign language studies, 64, 195-
3
277.
4
Liu, H., Yu, C., Yu, C., Chen, C., & Wu, H. (2020). A novel axle temperature forecasting method based on decomposition,
5
reinforcement learning optimization and neural network. Advanced Engineering Informatics, 44, 101089.
6
Luzanin, O., & Plancak, M. (2014). Hand gesture recognition using low-budget data glove and cluster-trained probabilistic
7
neural network. Assembly Automation, 34, 94-105.
8
Marschark, M., & Spencer, P. E. (2010). The Oxford handbook of deaf studies, language, and education (Vol. 2): Oxford
9
University Press.
10
Naglot, D., & Kulkarni, M. (2016). Real time sign language recognition using the leap motion controller. In 2016
11
International Conference on Inventive Computation Technologies (ICICT) (Vol. 3, pp. 1-5): IEEE.
12
Naidu, C., & Ghotkar, A. (2016). Hand Gesture Recognition Using Leap Motion Controller. International Journal of
13
Science and Research, 5.
14
Napier, J., & Leeson, L. (2016). Sign language in action. In Sign Language in Action (pp. 50-84): Springer.
15
Napier, J., Leigh, G., & Nann, S. (2007). Teaching sign language to hearing parents of deaf children: An action research
16
process. Deafness & Education International, 9, 83-100.
17
Oz, C., & Leu, M. C. (2007). Linguistic properties based on American Sign Language isolated word recognition with
18
artificial neural networks using a sensory glove and motion tracker. Neurocomputing, 70, 2891-2901.
19
Oz, C., & Leu, M. C. (2011). American Sign Language word recognition with a sensory glove using artificial neural
20
networks. Engineering Applications of Artificial Intelligence, 24, 1204-1213.
21
Parreño, M. A., Celi, C. J., Quevedo, W. X., Rivas, D., & Andaluz, V. H. (2017). Teaching-learning of basic language of
22
signs through didactic games. In Proceedings of the 2017 9th International Conference on Education Technology
23
and Computers (pp. 46-51): ACM.
24
Paudyal, P., Lee, J., Banerjee, A., & Gupta, S. K. (2019). A Comparison of Techniques for Sign Language Alphabet
25
Recognition Using Armband Wearables. ACM Transactions on Interactive Intelligent Systems (TiiS), 9, 14.
26
Pontes, H. P., Duarte, J. B. F., & Pinheiro, P. R. (2018). An educational game to teach numbers in Brazilian Sign Language
27
while having fun. Computers in Human Behavior.
28
Rastgoo, R., Kiani, K., & Escalera, S. (2020). Hand sign language recognition using multi-view hand skeleton. Expert
29
Systems with Applications, 150, 113336.
30
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-Validation. In L. Liu & M. T. ÖZsu (Eds.), Encyclopedia of Database
31
Systems (pp. 532-538). Boston, MA: Springer US.
32
Rojas, R. (1996). Neural networks-a systematic introduction springer-verlag. New York.
33
Souza, A. M. (2015). As Tecnologias da Informação e da Comunicação (TIC) na educação para todos. Educação em Foco,
34
349-366.
35
Starner, T., Weaver, J., & Pentland, A. (1998). Real-time American sign language recognition using desk and wearable
36
computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1371-1375.
37
Sun, C., Zhang, T., Bao, B., Xu, C., & Mei, T. (2013). Discriminative Exemplar Coding for Sign Language Recognition
38
With Kinect. IEEE Transactions on Cybernetics, 43, 1418-1428.
39
Tao, W., Leu, M. C., & Yin, Z. (2018). American Sign Language alphabet recognition using Convolutional Neural Networks
40
with multiview augmentation and inference fusion. Engineering Applications of Artificial Intelligence, 76, 202-
41
213.
42
Tubaiz, N., Shanableh, T., & Assaleh, K. (2015). Glove-Based Continuous Arabic Sign Language Recognition in User-
43
Dependent Mode. IEEE Transactions on Human-Machine Systems, 45, 526-533.
44
Valente, J. M., & Maldonado, S. (2020). SVR-FFS: A novel forward feature selection approach for high-frequency time
45
series forecasting using support vector regression. Expert Systems with Applications, 160, 113729.
46
Vassilvitskii, S. (2007). K-Means: algorithms, analyses, experiments: Stanford University.
47
Wu, J., & Jafari, R. (2017). Wearable Computers for Sign Language Recognition. In S. U. Khan, A. Y. Zomaya & A. Abbas
48
(Eds.), Handbook of Large-Scale Distributed Computing in Smart Healthcare (pp. 379-401). Cham: Springer
49
International Publishing.
50
Wu, J., Sun, L., & Jafari, R. (2016). A Wearable System for Recognizing American Sign Language in Real-Time Using
51
IMU and Surface EMG Sensors. IEEE Journal of Biomedical and Health Informatics, 20, 1281-1290.
52
Wu, J., Tian, Z., Sun, L., Estevez, L., & Jafari, R. (2015). Real-time American Sign Language Recognition using wrist-
53
worn motion and surface EMG sensors. In 2015 IEEE 12th International Conference on Wearable and
54
Implantable Body Sensor Networks (BSN) (pp. 1-6).
55
Zhong, B., Xing, X., Luo, H., Zhou, Q., Li, H., Rose, T., & Fang, W. (2020). Deep learning-based extraction of construction
56
procedural constraints from construction regulations. Advanced Engineering Informatics, 43, 101003.
57
58
... These modalities provide rich information about the signer's actions, and their utilization has been instrumental in enhancing and addressing multiple challenges in sign language recognition. A set of related works focuses on solving only one type of problem facing the sign language recognition using DL such as in [132,137,139,141,147,148,152,153,154,160,169,177,195,198,205,208,212,218,220,231,235,244,247,250,252,257,258,266], while others trying to solve multiple problems such as in [185,199].The most widely used feature is the spatiotemporal, that depends on the hand shape, and the location information of the hand [135,143,156,161,165,180,182,189,76,210,225,226,230,234,237,244,253,264,265]. However, there are works that make use of more than one type of features in addition to spatiotemporal such as facial expression, skeleton, orientation of hand and angles [138,141,144,151,152,79,159,162,164,166,170,171,175,185,186,191,192,197,199,203,204,208,212,228,231,232,233,245,246,255,258,260,261,268,269]. ...
... A set of related works focuses on solving only one type of problem facing the sign language recognition using DL such as in [132,137,139,141,147,148,152,153,154,160,169,177,195,198,205,208,212,218,220,231,235,244,247,250,252,257,258,266], while others trying to solve multiple problems such as in [185,199].The most widely used feature is the spatiotemporal, that depends on the hand shape, and the location information of the hand [135,143,156,161,165,180,182,189,76,210,225,226,230,234,237,244,253,264,265]. However, there are works that make use of more than one type of features in addition to spatiotemporal such as facial expression, skeleton, orientation of hand and angles [138,141,144,151,152,79,159,162,164,166,170,171,175,185,186,191,192,197,199,203,204,208,212,228,231,232,233,245,246,255,258,260,261,268,269]. Some works apply separate feature extraction techniques rather than depending only on the DL extracted features and managed to [103] obtain recognition results [149,152,153,79,159,162,166,169,171,175,177,179,187,189,191,192,197,199,203,204,208,228,231,233,235,237,245,246,255,258,260,261,265,268,269]. ...
... However, there are works that make use of more than one type of features in addition to spatiotemporal such as facial expression, skeleton, orientation of hand and angles [138,141,144,151,152,79,159,162,164,166,170,171,175,185,186,191,192,197,199,203,204,208,212,228,231,232,233,245,246,255,258,260,261,268,269]. Some works apply separate feature extraction techniques rather than depending only on the DL extracted features and managed to [103] obtain recognition results [149,152,153,79,159,162,166,169,171,175,177,179,187,189,191,192,197,199,203,204,208,228,231,233,235,237,245,246,255,258,260,261,265,268,269]. Recent works especially from 2020 onwards focus on developing a recognition system for continuous sentences in sign language, which is still an open problem that gathers the most attention and is not completely solved or employed in any commercial application. ...
Article
Full-text available
Sign language can be regarded as a unique form of communication method between human beings, which relies basically on visualized gestures of the individual body parts to transfer messages and obtains a substantial role in the life of impaired people having hearing and speaking disabilities deaf. There are various different signs in every sign language with differences in representation using hand shape, motion type, and location of the hand, face, and body portions participate in every sign. Understanding sign language by individuals without disabilities is a challenging operation. Therefore, automated sign language recognition has become a significant need to bridge the communication gap and facilitate the interaction between the deaf society, and the normal hearing majority. In this work, an extensive review of automated sign language recognition and translation of different languages around the world has been conducted. More than 140 research articles have been reviewed, and all of them are relying on deep learning techniques, which were published between 2018 and 2022, to recognize, and translate sign language. A brief review of concepts related to sign language is also presented including its types, and acquiring methods, as well as an introduction to deep learning, and the main challenges facing the recognition process. A description of the various types of public datasets of sign language in different languages is also presented and discussed.
... Sports have developed rapidly in recent years. Because it can create a game scenario with interesting storylines and rich dynamic interactions, allowing students to interact in the same situation, thereby making it easier and more enjoyable to improve their physical fitness [3,4]. This provides favorable conditions for personalized learning implementation. ...
Article
Full-text available
The application of games in sports not only brings new development to sports, but also brings new requirements to sports. To maintain and enhance students’ learning motivation and interest, more effective individualized teaching is needed. This study focuses on students’ individualized knowledge structure, tracking the differences in sports skills among different types of students, and designing a sports game model based on sports games. A Bayesian network model was used to model learners’ knowledge and establish an adaptive badminton competition mode. Then, a new feature intersection method is studied, and a method for deep knowledge tracking is established using feature embedding and attention mechanisms. Finally, on this basis, this method is combined with adaptive learning methods to establish a badminton game model based on adaptive learning and improved deep knowledge tracking. Additionally, this study explores the biological principles underlying sports skill learning. When students learn badminton skills. When students learn badminton skills, the body’s proprioceptive system constantly provides feedback. This biological feedback is vital as it helps students adjust their movements unconsciously. Motion capture technology can capture the kinematic data of students’ badminton movements, such as joint angles and limb velocities. By integrating this data with the biological feedback, we can optimize learning outcomes by enabling more precise identification of skill deficiencies and more effective remediation. The experiment shows that the acceleration Z is the maximum, approximately 100. The acceleration of X is the smallest, approximately between −200 and 200. The most unstable is the acceleration Y. After positive compensation, the value of the adaptive quantization parameter cascade algorithm increases. And the quantized adaptive quantization parameter cascade algorithm value does not have a significant impact on the evaluation of the reconstructed image. The average values of each scale and sub dimension are above 0.70, and the constituent reliability values are above 0.90, indicating that the internal quality of each scale is good. And the internal consistency between questions is also good, all of which have passed the validity test. The survey method used in this experiment has strong practicality and can effectively achieve game design, which has great practical value in teaching practice.
... Pedagogical strategies for integrating sign are very important as allow the teacher can use activities according to the students and develop the acquisition of knowledge through learning strategies in the classroom. As stated by (Souza, 2015) suggest that: in terms of educational technology, knowledge acquisition in students can be improved through the fusion of 3 academic activities with interactive, collaborative and immersive technologies. ...
Article
Full-text available
The present research article examines the role of sign language in the acquisition of the Spanish language through the teaching of sign language at the high school level. It focuses on a sample of high school students from a private institution in the city of Chone. Additionally, a descriptive approach was employed, centered on the scientific tasks of research and the respective sampling method. For data collection, two interviews were conducted with English teachers, and a survey was applied to an English teacher. The results indicate that the acquisition of the Spanish language through the use of sign language is significant for students. Moreover, the use of pedagogical strategies is essential to help them develop their communicative skills. On the other hand, it is important for teachers to use visual resources to ensure active participation from students. Consequently, the findings highlight factors influencing the learning and acquisition of the Spanish language through the use of sign language. In other words, this study contributes to educational research by providing relevant information about the importance of Spanish language acquisition for students with hearing impairments.
... This research designs an ASL learning through a game application and develops an SLR (sign language recognition system) using a leap motion controller. It uses the LSTM algorithm to recognize ASL alphabets [4]. Recognition of Sign Language for Hindi varnamala using the CNN model trains a set of 700 images per letter and a validation set of 100 images for each letter. ...
Article
Full-text available
Worldwide sign languages differ, and there isn't one universal sign language. Each country and state may have its sign language or set of related sign languages. Some of the previous research studies recognized the signs, but they required instruments like gloves, sensors, and kinetics, or many other hardware instruments that are not easily accessible for everyone. In this modern era, cameras are widely used or easily accessible to everyone. Recognition of Gujarati alphabet sign language with a camera presents a cost-effective technique to detect Gujarati alphabet signs. This research contains data acquisition, image pre-processing, feature extraction, and sign recognition. Data are collected from images taken from different people at different angles with signs. Augmentation of Data technique is also used to increase the sample size of the dataset. The model which is proposed is used as a long short-term network to translate sign language with around 98% accuracy. This study contributes to the development of an effective human-machine solution for the deaf society.
... Initial sign language recognition algorithms focused on hand shapes and limb trajectories using sensor-based motion capture and traditional analytical methods [7,8]. The advent of deep learning [9][10][11] has popularized computer vision-based approaches [12,13], which better address the complexities of dynamic sign recognition. Despite these advancements, current methods still face challenges in achieving both high accuracy and fast computation, especially when dealing with the intricate nature of dynamic sign language. ...
Article
Full-text available
Automatic Sign Language Recognition Systems (ASLR) offers smooth communication between hearing-impaired and normal-hearing individuals, enhancing educational opportunities for impaired. However, it struggles with “curse of dimensionality” due to excessive features resulting in prolonged training time and exhaustive computational demand. This paper proposes technique that integrates machine learning and swarm intelligence to effectively address this issue. The proposed technique, initially, extracts features using histrogram of gradient (HOG) approach and then reduces dimensions of extracted features using unsupervised autoencoder and subsequently refining the feature set with an improved GWO algorithm. A handcrafted artificial neural network serves as the classifier within this integrated framework, denoted as AEGWO-Net. Exhaustive experimentations were conducted on six different datasets namely ASL, ASL MNIST, ISL, ArSL, MNIST Digits, and IEEE-ISL containing gestures of different languages to demonstrate the performance of AEGWO-Net. The AEGWO-Net demonstrates superior performance improving accuracy and F1 score by 6% and 4% respectively compared to PCA-IGWO and KPCA-IGWO algorithms. Achieving high accuracy (98.40%), F1-score (96.59%), MCC (97.14%), and AUC (96.21%) indicates the robustness and generalizability of the AEGWO-Net method even with reduced dimensionality. Furthermore, a comparison between AEGWO-Net with other existing swarm intelligence techniques is also made to demonstrate its superiority.
Article
Full-text available
Linear or polynomial regression and artificial neural networks are often adopted to obtain correlation models between various attributes in engineering fields. Although these are straightforward, they may not perform well for datasets that involve complex correlations among multiple attributes, and overfitting can occur when high-order polynomials are used to match the data from one scenario but fail to produce accurate predictions elsewhere. This paper presents a Deep Neural Networks (DNN) based framework for obtaining complex correlations in engineering metrics and provides guidelines to assess data adequacy, remove outliers and to identify and resolve overfitting problems. Moreover, a clear and concise set of procedures for tuning hyperparameters of DNN is discussed. As an illustration, a DNN model was trained to predict the undrained shear strength of clays based on liquid limit, plastic limit, water content, vertical effective stress, and preconsolidation stress. This analysis is conducted with 1101 samples gathered from different sites all over the world. Prediction of soil strengths often involved significant uncertainties due to the natural variations in earth materials and site conditions, contributing to complex relationships among various material properties. Our results show that the proposed framework performs better than conventional correlation models established from previous studies. The developed framework and accompanying Python script can be readily applied to the prediction of clay properties at other sites, and also to other types of engineering metrics.
Article
Full-text available
Sign language is the most natural and effective way for communications among deaf and normal people. American Sign Language (ASL) alphabet recognition (i.e. fingerspelling) using marker-less vision sensor is a challenging task due to the difficulties in hand segmentation and appearance variations among signers. Existing color-based sign language recognition systems suffer from many challenges such as complex background, hand segmentation, large inter-class and intra-class variations. In this paper, we propose a new user independent recognition system for American sign language alphabet using depth images captured from the low-cost Microsoft Kinect depth sensor. Exploiting depth information instead of color images overcomes many problems due to their robustness against illumination and background variations. Hand region can be segmented by applying a simple preprocessing algorithm over depth image. Feature learning using convolutional neural network architectures is applied instead of the classical hand-crafted feature extraction methods. Local features extracted from the segmented hand are effectively learned using a simple unsupervised Principal Component Analysis Network (PCANet) deep learning architecture. Two strategies of learning the PCANet model are proposed, namely to train a single PCANet model from samples of all users and to train a separate PCANet model for each user, respectively. The extracted features are then recognized using linear Support Vector Machine (SVM) classifier. The performance of the proposed method is evaluated using public dataset of real depth images captured from various users. Experimental results show that the performance of the proposed method outperforms state-of-the-art recognition accuracy using leave-one-out evaluation strategy.
Article
In this paper, we propose a novel support vector regression (SVR) approach for time series analysis. An efficient forward feature selection strategy has been designed for dealing with high-frequency time series with multiple seasonal periods. Inspired by the literature on feature selection for support vector classification, we designed a technique for assessing the contribution of additional covariates to the SVR solution, including them in a forward fashion. Our strategy extends the reasoning behind Auto-ARIMA, a well-known approach for automatic model specification for traditional time series analysis, to kernel machines. Experiments on well-known high-frequency datasets demonstrate the virtues of the proposed method in terms of predictive performance, confirming the virtues of an automatic model specification strategy and the use of nonlinear predictors in time series forecasting. Our empirical analysis focus on the energy load forecasting task, which is arguably the most popular application for high-frequency, multi-seasonal time series forecasting.
Article
Axle temperature forecasting technology is important for monitoring the status of the train bogie and preventing the hot axle and other dangerous accidents. In order to achieve high-precision forecasting of axle temperature, a hybrid axle temperature time series forecasting model based on decomposition preprocessing method, parameter optimization method, and the Back Propagation (BP) neural network is proposed in this study. The modeling process consists of three phases. In stage I, the empirical wavelet transform (EWT) method is used to preprocess the original axle temperature series by decomposing them into several subseries. In stage II, the Q-learning algorithm is used to optimize the initial weights and thresholds of the BP neural network. In stage III, the Q-BPNN network is used to build the forecasting model and complete predicting all subseries. And the final forecasting results are generated by combining all prediction results of subseries. By comparing all results over three case predictions, it can be concluded that: (a) the proposed Q-learning based parameter optimization method is effective in improving the accuracy of the BP neural network and works better than the traditional population-based optimization methods; (b) the proposed hybrid axle temperature forecasting model can get accurate prediction results in all cases and provides the best accuracy among eight general models.
Article
Hand sign language recognition from video is a challenging research area in computer vision, which performance is affected by hand occlusion, fast hand movement, illumination changes, or background complexity, just to mention a few. In recent years, deep learning approaches have achieved state-of-the-art results in the field, though previous challenges are not completely solved. In this work, we propose a novel deep learning-based pipeline architecture for efficient automatic hand sign language recognition using Single Shot Detector (SSD), 2D Convolutional Neural Network (2DCNN), 3D Convolutional Neural Network (3DCNN), and Long Short-Term Memory (LSTM) from RGB input videos. We use a CNN-based model which estimates the 3D hand keypoints from 2D input frames. After that, we connect these estimated keypoints to build the hand skeleton by using midpoint algorithm. In order to obtain a more discriminative representation of hands, we project 3D hand skeleton into three views surface images. We further employ the heatmap image of detected keypoints as input for refinement in a stacked fashion. We apply 3DCNNs on the stacked features of hand, including pixel level, multi-view hand skeleton, and heatmap features, to extract discriminant local spatio-temporal features from these stacked inputs. The outputs of the 3DCNNs are fused and fed to a LSTM to model long-term dynamics of hand sign gestures. Analyzing 2DCNN vs. 3DCNN using different number of stacked inputs into the network, we demonstrate that 3DCNN better capture spatio-temporal dynamics of hands. To the best of our knowledge, this is the first time that this multi-modal and multi-view set of hand skeleton features are applied for hand sign language recognition. Furthermore, we present a new large-scale hand sign language dataset, namely RKS-PERSIANSIGN, including 10′000 RGB videos of 100 Persian sign words. Evaluation results of the proposed model on three datasets, NYU, First-Person, and RKS-PERSIANSIGN, indicate that our model outperforms state-of-the-art models in hand sign language recognition, hand pose estimation, and hand action recognition.
Article
Skin segmentation is one of the most important tasks for human activity recognition, video monitoring, face detection, hand gesture recognition, content-based detection, adult content filtering, human tracking, and robotic surgeries. Skin segmentation in ideal situations is easy to accomplish because it is with similar backgrounds. However, the skin segmentation in non-ideal situations is complicated due to difficult background illuminations, the presence of skin-like pixels, and environmental changes. The current studies are handling the mentioned challenges by adding the preprocessing stages in their methods, which increases the overall cost of the system. In addition, prevailing segmentation studies have ignored black skin and mainly focused on white skin for their experiments. To deal with skin segmentation in challenging environments irrespective of skin color, and to eliminate the cost of the preprocessing, this paper proposes an outer residual skip connection-based deep convolutional neural network (OR-Skip-Net) which innovatively empowers the features by transferring the direct edge information from the initial layer to the end of the network. Experiments were performed on the following eight open datasets for skin segmentation in different environments: hand gesture recognition dataset, event detection dataset, laboratoire d'informatique en image et systèmes d'information dataset, in-house dataset, UT-interaction dataset, augmented multi-party interaction dataset, Pratheepan dataset, and black skin people dataset. In addition, two other experiments were performed for gland segmentation from colon cancer histology images for the diagnosis of colorectal cancer using the Warwick-QU dataset and for iris segmentation using the Noisy Iris Challenge Evaluation - Part II dataset to explore the possibility of applying our method to different applications. Experimental results showed that the proposed OR-Skip-Net outperformed existing methods in terms of skin, gland, and iris segmentation accuracies.
Article
Recently, the studies based on Deep Neural Networks and Extreme Learning Machines have become prominent. The models of parameters designed in these studies have been chosen randomly and the models have been designed in this direction. The main focus of this study is to determine the ideal parameters i.e. optimum hidden layer number, optimum hidden neuron number and activation function for Deep Neural Networks and Extreme Learning Machines architectures based on growing and pruning approach and to compare the performances of the models designed. The performances of the models are evaluated on two datasets; Parkinson and Self-Care Activities Dataset. Multi experiments have verified that the Deep Neural Networks architectures present a good prediction performance and this architecture outperforms the Extreme Learning Machines.