ResearchPDF Available

Real-time Vernacular Sign Language Recognition using MediaPipe and Machine Learning

Authors:

Abstract

The deaf-mute community have undeniable communication problems in their daily life. Recent developments in artificial intelligence tear down this communication barrier. The main purpose of this paper is to demonstrate a methodology that simplified Sign Language Recognition using MediaPipe’s open-source framework and machine learning algorithm. The predictive model is lightweight and adaptable to smart devices. Multiple sign language datasets such as American, Indian, Italian and Turkey are used for training purpose to analyze the capability of the framework. With an average accuracy of 99%, the proposed model is efficient, precise and robust. Real-time accurate detection using Support Vector Machine (SVM) algorithm without any wearable sensors makes use of this technology more comfortable and easy.
International Journal of Research Publication and Reviews Vol (2) Issue (5 ) (2021) Page 9-17
International Journal of Research Publication and Reviews
Journal homepage: www.ijrpr.com ISSN 2582-7421
* Corresponding author.
E-mail address: arpitahalder739@gmail.com
Real-time Vernacular Sign Language Recognition using MediaPipe and
Machine Learning
Arpita Haldera, Akshit Tayadeb
aUndergraduate Student, Computer Science and Engineering Department, Budge-Budge Institute of Technology, arpitahalder739@gmail.com,India
bUndergraduate Student,Electronics and Telecommunication Department, K.J Somaiya College of Engineering, tayadeakshit28@yahoo.com,India
A B S T R A C T
The deaf-mute comm unity have undeniable communication problems in their daily life. Recent developments in artificial intelligence tear down this
communication barrier. The main purpose of this paper is to de monstrate a methodology that simplified Sign Language Recogniti on using Me diaPipe’s open-
source framework and machine learning algorit hm. The predictive model is lightweight and adaptable to smart devices. Multiple sign language datasets suc h
as American, Indian, Italian and Turkey are used for training purpose to analyze the capability of the framework. With an average accuracy of 99%, the
proposed model i s efficient, preci se and rob ust. Real -time accurate detection using Support Vector Machine (SVM ) algorithm without any wearable sensors
make s use of th is te chnolo gy mo re com forta ble and easy.
Keywords:M achine Learning, Sign Language Recognition, MediaPipe, Feature extraction, Hand ge sture
1. Introduction
Sign Language significantly facilitates communicati on in the deaf community. Sign language is a language in which communicati on is based on visu al
sign pattern s to express one’s feelings. There is a communication gap when a deaf community wants to express their views, thought of speech and hearing
with normal people. Currently, two communities mostly rely on human -based translator which can be expensive and inconvenient. With the development
in areas of deep learning and computer vision, researchers have developed various automatic sign la nguage recognition methods that can interpret sign
gestures in an understandable way. This narrow d owns the communication gap between impaired and normal people. This also empowers deaf-mute
people to stand with an equal opportunity and improve personal growth.
In accordance with the report of the World Federation of the Deaf (W FD) over 5% of t he world’s population ( 360 million people) has
hearing impairment including 328 million a dults a nd 32 Million children. Approximately there are about 300 sign language is i n use around the gl obe.
Sign language recognition is a challenging task as sign language alphabets are different for different sign languages. For instance, American Sign
Language (ASL) alphabet s vary widely from Indian Sign Language or Italian Sign Language. Thus Sign language varies from region to region. Moreover,
articulation of single as well as double hands is used to convey meaningful messages. Sign Language can be expressed by the compressed version, where
a single gesture is sufficient to describe a word. Now, sign language also has fingerspelling to describe each alphabet of the word using different signs
corresponding to a particular letter. As there are many wor ds still not standardized in sign language dictionaries, fingerspelling is often used to manifest a
word. There are still about 150,000 words in spoken English having no counterpart in ASL. Furthermore, any name of people, places, brands or titles
doesn’t have any standardized sign symbol. Besides, a user might not be aware of the exact sign of any particular word a nd in this scenario , fingerspelling
comes in handy and any word can be easily described.
10 International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17
Previous works included sensor-based Sign language Recognition (SLR) system, which was quite uncomfortable and more restrictive for
signers. Specialized hardware for example sensors [1], [2] were used which were an expensive option as well. Whereas, compute r vision-ba sed te chniqu es
uses bare hands without any sensors or coloured gloves. Du e to the use of single camera, computer-vision based technique is more cost-effective and
highly portable compared to sensor-based techniqu es. In computer-vi sion based methods, the most common approach for hand-tracking is skin colour
detection or background subtraction. Computer vision-based SLR system often deals with feature extraction example boundary modelling, contour,
segmentation of gestures and estimation of hand shapes. But, all these solutions are not li ghtweight enough to run in real-time devices l ike mobile phone
applications and thu s are restricted to platform equipped with robu st processors. Moreover, the challenge of ha nd-tracking remained persistent in all these
techniques. To address this drawback, our proposed methodology used a n approach that involves G oogle’s innovative, rapidly growin g and open source
project MediaPipe and a machine learning algorithm on top of this framework to get a faster, simpler, cost -effective, portable and easy to deploy pipeline
which can be used as a sign language recognition system.
2. Related Works
Relatively hand gesture recognition is a difficult problem to address in the field of machine learning. Classification method s can be divided into
supervised and unsupervised method. Based on these methods the SLR system can recognize static or dynamic sign gestures of hands. Murakami and
Taguchi [3] in the year 1991, published a resear ch article using neural network for the first time in sign la nguage recognition. Wit h th e devel opment in the
field of computer vision, numerous resear chers ca me up with novel approaches to help the physically challenged community. Using coloured gloves, a
real-time hand tra cking a pplication was developed by Wang and Popovic[4]. The colour pattern of the gloves was recognized by K-Nearest Neighbors
(KNN) technique but continuous feeding of hand streams is required for the system. However, Support Vector M echanism (SVM) ou tperformed this
algorithm in the research findings of Rekha et al.[5], Kurdyumov et al.[6], Tharwat et al.[7] and Baranwal and Nandi[8]. There are two types of Sign
Language Recognition: Isolated sign recognition and continuous sentence recognition. Likewise, whole sign level modelling and subun it sign level
modelling exist in the SLR system. Visual-descriptive and lingui stic-oriented are two approaches that lead to subunit level sign modelling. Elakkiya et
al.[9] combined SVM l earning and boosting algorithm to propose a framework for subunit recognition of alphabets. An a ccuracy of 9 7.6% was obtained
but the system fails to predict 26 alphabets. To extract features of 23 isolated Arabic sign language Ahmed and Aly[10] used the combination of PCA and
local binary patterns. Despite getting an accuracy of 99.97% in signer dependent mode, due to the usage of threshold operator the system fails to recognize
the constant grey-scale patterns in the signing area. In t he field of machine learning, recognizing hand gesture is relatively problematic to s olve. In most of
the initial attempts, a conventional convolutional network is used that detects handgestures from frames of images. R.Sharma et al.,[11] used 80000
individual numeric signs with more than 500 pictures per sign to train a machine learning model. Their system methodology comprises a training database
of pre-processed images for a hand-detection system and a gesture recognition system. Image pre-processing included feature extraction to normalize the
input information before training the machine learning model. The images are converted into grayscale for better object contour maintaining a
standardized resolution and then flattened into a smaller amount of one-dimensional components. The feature extraction technique helps to extract certain
features about the pixel data from images and feed them to CNN for easier training and more accurate prediction. Hand tracking in 2D a nd 3D space has
been performed by W.Liu et al.,[12].They used skin saliency where skin t ones within a specific range were extracted for better feature extraction and
achieved a classification accuracy of around 98%.
It is evident from a ll these p revious methods that to recognize hand gesture precisely with high accura cy, models r equire a large dataset and complicated
methodology with complex mathematical processing. Pre -processing of images plays a vital in the gesture tracking process. Therefore, for our project, we
used an open-source framework from Google known as Mediapipe which is capable of detecting human body part accurately.
3. Dataset
Table 1: Details of different sign language fingerspelling datasets used in this work
Database
Type
No. of images
Image Samples
American
Alpha bet s
156000
Indian
Alpha bet s
4972
International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17 11
Italian
Alpha bet s
12856
American
Numbers
1400
Turkey
Numbers
4124
4. Architecture
Figure 1: Proposed architecture to detect handgestures and predict sign language finger -spellings
1.1 Stage 1: Pre-Processing of Images to get Multi-hand Landmarks using MediaPipe
MediaPipe is a framework that enables developers for building multi-modal(video, audio, any times series data) cross-platform applied ML pipelines.
MediaPipe has a large collection of hu man body detection and tracking models which are trained on a massive and most diverse dataset of Google. As the
skeleton of nodes and edges or landmarks, they tra ck key points on different parts of the body. All co-ordinate points are three-dimension normalized.
Models build by Google developers using Tensorflow lite facilitates the flow of information easily a daptable and modifiable via graphs. MediaPipe
pipelines are composed of nodes on a graph which are generally specified in pbtxt file. These nodes are connected to C++ files. E xpan sion upon th ese file s
is the base calculator class in Mediapipe. Just like a video stream this class gets contracts of media streams from other nodes in the graph a nd ensures that
12 International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17
it is connected. Once, rest of the pipelines nodes are connected, the class generates its own output processed data . Packet objects encapsulating many
different types of information are used to send each stream of information to each calculator. Into a graph, side packets can al so be imposed, where a
calculator node can be introduced with auxiliary data like constants or static properties. This simplified structure in the pipeline of dataflow enables
additions or modifications with ease and the flow of data becomes more precisely controllable.
The Ha nd tra cking solution [13] has an ML pipeline at its backend consisting of two models working dependently with each other: a) Palm D etection
Model b) Land Landmark Model. T he Palm Detection Model provides an accurately cropped palm image and further is passed on to the landmark model.
This process diminishes the use of data augmentation (i.e. Rotations, Flipping, Scaling) that is done i n Deep Learning models a nd dedicates most of its
power for landmark localization. The traditional way is to detect the hand from the frame and then do landmark localization o ver the current frame. But in
this Palm Detector using ML pipeline challenges with a different strategy. Detecting hands is a complex procedure as you have to perform image
processing and thresholding and work wit h a variety of hand sizes which l eads to consumption of time. In stead of directly detecting hand from the current
frame, first, the Palm detector is trained which estimates bounding boxes around the rigid objects like palm and fists which is simpler than detecting hands
with coupled fingers. Secondly, an encoder-decoder is used as an extractor for bigger scene context.
Figure 2: 21 Hand Landmar ks
After the palm detection is skimmed over the whole image frame, subsequent Hand Landmark models comes into th e picture. This model preci sely
localize 21 3D hand-knuckle coordinates (i.e., x, y, z-axis) inside the detected hand regions. The model is so well trained and robust in hand detection that
it even maps coordinates to partially visible hand. Figure 2 shows the 21 landmark points detection by the Hand Landmark mode l.
Now that we ha ve a functional Palm and Hand detection model running, this model is passed over our dataset of various language. Considering the
American Sign Language dataset, we have a to z alphabets. So, we pass our detection model over every alphabet folder containing images and perform
Hand detection which yields us the 21 landmark points a s shown in Figure 2. T he obtained landmark points are then stored in a file of CSV format. A
simultaneous, elimination task is performed while extracting the landma rk points. Here, only the x and y coordinates detected by the Hand Landmark
model is considered for training the ML model. Depending upon the size of the dataset around 10 -15 minutes is required for Landmark extraction.
1.2 Stage 2: Data cleaning and normaliza tion
As in stage 1, we are only considering x and y coordinates from the detector, each i mage in the dataset is pa ssed through sta ge 1 to c ollect all the data
points u nder one file. This file is then scraped t hrough the pandas' library function to check for any nulls entries. Sometimes due to blurry image, the
detector cannot detect the hand which l eads to null entry into the dataset. Hence, it is necessary to clean these points or will lead to bia sness whil e making
the predictive model. Rows containing these null entries are searched and using their indexes removed from the table. After the removal of u nwanted
points, we normalized x and y coordinates to fit into our system. The data file is then prepared for splitting into training and validation set. 80% of the data
is retained for training our model with various optimization and loss function, whereas 20% of data is reserved for validatin g the model.
1.3 Stage 3: Prediction using Machine Learni ng Algorithm
Predictive analysis of different sign languages are p erformed using machine learning algorithms a nd Support Vector Machine (SVM) outperformed other
algorithms. T he details of the analysis ar e discu ssed in table 2 i n the result section. SVM is effective in high dimensional spaces. In the case where the
number of samples are greater than the number of dimensions, SVM performs effectively. SVM is a cluster of supervised learning methods capable of
classification, regression and outliers detection.
The following formula poses the optimization problem tackled by SVMs:
min󰇛,,󰇜1
2+ (1)
=1
󰇛󰇛󰇜+󰇜> 1 (2)
In equation (1) and equation (2), denotes the distances to the correct margin with >= 0, I = 1, …, n, C denotes a r egularization parameter, =
denotes the normal vector, 󰇛󰇜 denotes the transformed input space vector, b denotes a bias parameter, denotes the i-th target value. The
objective is to classify as many data points correctly as possible by maximizing the margin from the Support Vectors to the hyperplane while minimizing
the term wTw. The kernel function used is RBF (radical basis function) that turns the input space into a higher -dimensional space, so t hat not every d ata
point is explicitly mapped. SVM works r elatively well w hen there is a clear margin of separati on between classes. Hence, we used SVM to classify
multiple classes of sign language alphabets and numerics.
International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17 13
1.4 Quantitative Analysis
To analyze results for each of the data sets, we used performance matrix such as accuracy, precision, recall, F1 score. Accuracy is the number of correctly
predicted data points out of all t he data points. can be calcula ted as the number of all correct predictions to the total number of items in the data
measures, shown in equation (3).
 =+
+ + + (3)
 = 
+ (4)
 =
+ (5)
 describes how a ccurate our model is out of those predicted positives, how many of them are actual positive. is a good measure to determine,
when the cost of False positive is high.  calculates how many of the actual positives our model capture by labelling them as positive.  represent
the model metric we will select when there is a high cost associated with False Negatives. The mathematical formulation o f Prec ision and Reca ll are gi ven
in equation (4) and (5) respectively.
 =2 × ×
+ (6)
F-Measure in equation (6) provides a way to combine both precision and recall into a single measure that captures both properties. It is u sed t o ha ndle
imbalanced classification. Confusion matrix was also analyzed to have a better understanding of the types of errors b eing made by our classifier. The key
to confusion matrix is number of correct and incorrect predictions are summarized with count valu es and broken down by each class.
5. Result and Discussion
A K-Fold Cross-Validation was performed on the dataset by taking ten folds. The average accuracy over ten iterations of different algorithms is
demonstrated in Table 2. It can be observed from the presented accuracies that SVM outperformed other machine learning algorithms such as KNN,
Random Forest, Decision Tree, Na ïve Ba yes and also a chieved higher accuracy than deep learning algorithm s such as Artificial Neural Network (ANN)
and Multi-Layer Perceptron (MLP).
Table 2: Average accuracy obtained using machine learning and deep learning algorit hms.
Dataset
SVM
KNN
Random Forest
Decision Tree
Naive Bayes
ANN
MLP
ASL(alphabet)
99.15%
98.21%
98.57%
98.57%
53.74%
97.12%
94.69%
Indian(alphabet)
99.29%
98.87%
98.59%
98.59%
86.77%
94.79%
96.48%
Italian(alphabet)
98.19%
96.75%
97.83%
97.83%
77.19%
78.63%
72.14%
ASL(numbers)
99.18%
99.18%
97.56%
97.56%
96.74%
95.12%
97.56%
Turkey (numbers)
96.22%
93.08%
94.33%
94.33%
83.64%
93.71%
83.64%
The highest accuracy achieved u sing the model is bolded in the above table for each of the sign language dataset s.
For exhau stive testing, each sign language i mage dataset is pre-pr ocessed to extract features using MediaPipe framework and trained in Support Vector
Machine to classify gestures correctly. An a ccuracy of 99% is achieved for most of the datasets which outp erform present state-of-arts and classify
fingerspellings of Sign Languages precisely. Maximum accuracy of 99.29% is gained for Indian Sign La nguage and minimum accura cy of 96.22% is
obtained for Turkey Sign Language numbers prediction using handgestures. T he testing performance for each dataset is summarized in Table 3. Confusion
matrix is illustrated in figure 3 and figure 4 demonstrates real -time sign language detection.
Table 3: Performance analysis using SVM algorithm on different datasets
Dataset name
Training Accuracy
Testing Accuracy
Precision
Recall
ASL(alphabet)
99.50%
99.15%
99.15%
99.15%
Indian(alphabet)
99.92%
99.29%
99.29%
99.29%
Italian(alphabet)
99.72%
98.19%
98.19%
98.19%
14 International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17
Turkey (numbers)
99.37%
96.22%
96.22%
96.22%
American (numbers)
98.77%
99.18%
99.18%
99.18%
The train ed model i s explicitl y lightweight which mak es our machine learning model appropriate for deployment in mobile applica tion. Real -time sign
language detection makes our methodology fast, robust, a daptable specifically for smart devices. Mediapipe’s stat e-of-art makes feature extraction easy by
breaking down and ana lyzing c omplex hand-tracking information, without the need t o build a convolutional neural network from scratch. The proposed
methodology uses minimum computational power and consumes less time to trai n model than other state-of-arts present. Table 4 illustrates comparison of
the performance of other works of literature using machine learning / deep learning algorithms and ours.
Table 4: Comparison with other current methods.
The highest accuracy is bolded in the above table for each of the sign language dataset.
Sign Language
Reference
Type
Number of classes
Method
Accuracy
America n
P.Das et al.,[14]
Alphabets
26
Deep CNN
94.3%
M.Taskiran et al.,[15]
Alphabets and
Numbers
36
CNN
98.05%
N.Saquib and
A.Rahman[16]
Alphabets
24
KNN
96.14%
Random Forest
96.13%
ANN
95.87%
SVM
94.91%
Ours
Alphabets
26
SVM
99.15%
Numbers
10
SVM
99.18%
Indian
K.K.Dutta et al.,[17]
Alphabets
24
KNN
94%-96%
M.Sharma et al.,[18]
Numbers
10
KNN and Neural
Network
97.10%
J.L.Raheja et al.,[19]
Alphabets
24
SVM
97.5%
Ours
Alphabets
26
SVM
99.29%
Italian
L.Pigou et al.,[20]
Alphabets
20
CNN
91.7%
Ours
Alphabets
22
SVM
98.19%
International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17 15
(a)
(b)
(c)
(d)
16 International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17
(e)
Figure 3: Confusion matrix a) American Sign Language (alphabets), b) American Sign Language (numbers), c) Indian Sign Language
(alphabets), d) Turkey Sign Language (numbers), e) Italian Sign Language (alphabets)
American Sign Language - Alphabets
American Sign Language - Numbers
International Journal of Research Publication and Reviews Vol (2) Issue ( 5) (2021) Page 9-17 17
Figure 4: Real-time America n Sign Language Recognition. American alphabets: ‘S’, ‘U’, ‘I’ and numbers: ‘1’, ‘3’, ‘9’
6. Conclusion
With an average accuracy of 99% in most of the sign language dataset using MediaPipe’s technology and machine learning, our proposed methodology
show that MediaPipe can be efficiently used as a tool to detect complex hand gesture precisely. Although, sign language model ling using image
processing techniques has evolved over the past few years but methods are complex with a requirement of high computational power. Time consumption
to train a model is also high. From that perspective, t his work provides new insights into this problem. Less computing power and the adaptability to smart
device s makes th e model robust and cost-effective. Training and testing with various sign language datasets show this framework can be adapted
effectively for any regional sign language data set and maximu m accuracy can be obtained. Fa ster real -time detection demonstrates the model’s efficiency
better than the present state-of-arts. In the future, the work can be extended by introducing word detection of sign language from videos using Mediapipe’s
state-of-art and best possible classi fication algorithms.
REFERENCES
[1] Shukor AZ, Mi skon MF, Jamaluddin MH, Bin Ali F, Asyraf MF, Bin Bahar MB. 2015. A ne w data glove ap proach for Malaysian sign language det ection.
Procedia Comput Sci 76:6067
[2] Almeida SG, Guimarães FG, Ramírez JA. 2014. Feature extraction in Brazilian sign language recognition based on phonological structure and using RGB-D
sensors. Expert Syst Appl 41(16):72597271
[3] Murakami K, Taguchi H. 1991. Gesture recognition using recurrent neural net works. In: Proceedings of the ACM SIGCHI c onference on Human factors in
computing syste ms, pp 237242. https ://dl.acm.org/doi/pdf/10.1145/10884 4.10890 0
[4] Wang RY, Popović J. 2009. Real-time hand-tracking with a color glove. ACM Trans Graph 28(3):63
[5] Rekha J, Bhattacharya J, Majumder S. 2011. Hand gesture recognition for sign language: a new hybrid approach. In: International Conference on Image
Processing, Computer Vision, and Pattern Recognition (IPCV), pp 8086
[6] Kurdyumov R, Ho P, Ng J. 2 011. Sign language classificati on using webcam images, pp 1 4. http://cs229 .stanf ord.edu/proj2 011/ Kurdy umovH oNg-SignL
angua geCla ssifi catio nUsin gWebc amIma ges.pdf
[7] Tharwat A, Gaber T, Ha ssanien AE, Shahin MK, Refaat B. 2015. Sift-based arabic sign language reco gnition system. In: Springer Afro -European conference
for industrial adva ncement, p p 359370. https ://d oi.or g/10.1 007/9 78-3-319-13572 -4_30
[8] Baranwal N, Na ndi GC. 2017. An efficient gesture based humanoid learning u sing wavelet descriptor and MFCC techniques. Int J Mach Learn Cybern
8(4):13691388
[9] Elakkiya R, Sel vamani K, Velumadhava Rao R , Kanna n A. 201 2. Fuz zy hand gesture rec ognition based human computer i nterface inte lligent syste m. UACEE
Int J Adv Comput Netw Secur 2(1):29 33 (ISSN 2250 3757)
[10] Ahmed AA, Aly S. 20 14. Ap pearance-based arabic sign language rec ogniti on usi ng hid den markov models. In: IEEE International Conference on Engineering
and Technology (ICET), pp 16. https ://doi.org/10.1109/ICEng Techn ol.2014.70168 04
[11] R. Sharma, R. Khapra, N. Da hiya. June 2020. Sign Language Gesture Recognition, pp.14-19
[12] W. Liu, Y. Fan, Z. Li, Z. Zhang. Jan 2015 . Rgbd video based human hand t rajectory tracking and gesture recognition system i n Mathematical Problems in
Engineering,
[13] Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C. L., & Grundmann , M. 2020. MediaPipe Hands: On-device Real-time Hand
Tracking. arXiv preprint arXiv:2006.10214.
[14] Das, P., Ahmed, T., & Ali, M. F. 2020, June. Static Hand Gesture Recognition for American Sign Language using Deep Convolutiona l Neural Network.
In 2020 IEEE Region 10 Symposium (TENSYMP) (pp. 1762-1765). IEEE.
[15] M. Taskiran, M. Killioglu and N. Kahraman. 2018 . A Real -Time System for Recognition of American Sig n Language by using Deep Learning, 20 18 4 1st
International Conference on Telecommunications and Signal Processing (TSP ), Athens, Greece, pp. 1-5, doi: 10.1109/TSP.2018.8441304.
[16] Nazmus Saquib and Ashikur R ahman. 2020. Applicati on of machine learning techniques for real-time sign language detection using wearable sensors. In
Proceedings of the 11th ACM Multimedia Systems C onference (MMSys '20). Association for Computing Machinery, New York, NY, USA, 178189.
DOI:http s://doi.o rg/10.1 145/333 9825.339 1869
[17] Dutta, K. K., & Bellary, S. A. S. 2017, September. Machine learning techniques for Indian sign l anguage recognit ion. In 2017 International Conference on
Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC) (pp. 3 33-336). IEEE.
[18] Sahoo, Ashok. 2014. Indian sign language recognition using neural net works and kNN classifiers. Journal of Engineer ing and Applied Sciences. 9. 1255-1259.
[19] Raheja JL, Mishra A, Chaudary A. 2016 September . Indian Sign Language Recognition Using SVM 1. P attern Recognition and Image Analy sis.; 26 (2).
[20] Pigou, L., Dieleman, S., Kindermans, P. J., & Schrauwen, B. 2014, September. Sign langua ge recognition using convolutional neural net works. In Eu ropean
Conference on Computer Vision (pp. 572-578). Springer, Cham.
... There are different approaches to solve the problem of making VC recognize sign language gestures. In 2021, Aripita Halder, made a methodology that simplified sign language in real time using MediaPipe and SVM, obtaining an average accuracy of 99%, the proposed model is accurate, efficient and can help the communication process (15). ...
Article
Full-text available
La visión por computadora (VC) puede ser un proceso que facilite algunas tareas en la gestión de inventarios, por medio de este proceso se puede realizar un análisis permanente de un inventario y así mantener registro de todos los movimientos realizados, entregando un reporte instantáneo cuando sea requerido. Esto supone una mejora en la seguridad, ya que al mantener un control estricto de los elementos existentes en el inventario se puede saber si un elemento pertenece o no a un inventario o cuando se retira o agrega un elemento, tras esta necesidad de control de inventario, surge la necesidad de diseñar un sistema inteligente que pueda facilitar el control de inventarios. Mediante la combinación de 2 frameworks, se realiza la creación de un algoritmo capaz de realizar la identificación y conteo de objetos, así como la identificación de la mano para determinar cuándo se realiza una manipulación humana al inventario. Para lograr este objetivo, se utilizaron dos algoritmos: MediaPipe y YOLOv5 combinado con el dataset de COCO, el primero se usó para la detección de manos y el segundo identifica y cuenta los objetos. Después de las pruebas realizadas al algoritmo se determina que el reconocimiento de manos de MediaPipe tuvo una precisión del 96% y la detección y clasificación de objetos usando YOLO fue de 43.7%. Teniendo como retos el algoritmo la superposición, la oclusión/auto oclusión de los objetos, o la pérdida de foco de los elementos debido al sensor.
... Several attempts have been made to recognize sign language, and many models have been proposed. The paper by Halder et al. [7] and Bora et al. [8] uses MediaPipe to detect palms and hand landmarks, followed by using a neural network technique, whereas Zhu et al. [9] use cross-resolution knowledge distillation method to achieve sign language recognition. Devdatta et al. [10] uses CNN-based methods to detect hand gestures, extract features, and classify the signs, and uses NLP techniques to figure out the grammar and structure of the sign language and convert it to text. ...
... This improves recognition accuracy for people with unique sign patterns. The system also includes a monitoring module that uses computer vision techniques to detect changes in the environment and respond accordingly [7]. We have evaluated the performance of the proposed system on a range of metrics including accuracy, response time and user satisfaction. ...
... The achievement in American Sign Language (ASL) classification was 99.76%. In the same context, [7] utilized various machine learning and deep learning algorithms. For machine learning, ASL alphabet SVM outperformed with an accuracy of 99.15%, followed by Random Forest at 98.57%, Decision Tree at 98.57%, and lastly, KNN with an accuracy of 98.21%. ...
Conference Paper
Full-text available
As one gets older, his/her mobility tends to decrease. Therefore, simple tasks such as getting up to switch the lights on or turning the fan off can become difficult. Thus, it became imperative to create a system which allows them to perform these tasks - a “Hand Recognition based Home Automation System”. Starting with the research unraveled different ways of implementation of the hand gesture recognition. After considering the functionalities and drawbacks of various methodologies, a library called MediaPipe, was the one that resonated best with this project. This paper includes analysis and comparison of various types of models on a hand gesture dataset - HaGRID.
Conference Paper
Full-text available
One of the complicated issues of computer vision is the recognition of any sign languages. The deaf and the dumb people use these sign languages to communicate. In the area of deep learning at recent progress there are numerous applications that neural networks can have for interpreting sign languages. The recognition of American Sign Language static images employing a capable artificial intelligence tool, Convolutional Neural Network has been proposed in this paper. ASL dataset of 1815 images of 26 English alphabets has been used to train and validate our model. Validation accuracy has been found 94.34% which is better than many existing methods
Conference Paper
Full-text available
Deaf people use sign languages to communicate with other people in the community. Although the sign language is known to hearing-impaired people due to its widespread use among hearing-impaired people, the sign language is not known much by other people. In this article, we have developed a real-time sign language recognition system for people who do not know sign language to communicate easily with hearingimpaired people. The sign language is American sign language. In this study, the convolutional neural network was trained by using dataset collected in 2012 by Massey University, Institute of Information and Mathematical Sciences, and 100% test accuracy was obtained. After network training is completed, the network model and network weights are recorded for the realtime system. In the real-time system, the skin color is determined for a certain frame for hand use, and the hand gesture is determined using the convex hull algorithm, and the hand gesture is defined in real-time using the registered neural network model and network weights. The accuracy of the real-time system is 98.05%.
Article
Full-text available
Recognizing any gesture, pre-processing and feature extraction are the two major issues which we have solved by proposing a novel concept of Indian Sign Language (ISL) gesture recognition in which a combination of wavelet descriptor (WD) and Mel Sec Frequency Cepstral Coefficients (MFCC) feature extraction technique have been used. This combination is very effective against noise reduction and extraction of invariant features. Here we used WD for reducing dimensionality of the data and moment invariant point extraction of hand gestures. After that MFCC is used for finding the spectral envelope of an image frame. This spectral envelope quality is useful for recognizing hand gestures in complex environment by eliminating darkness present in each gesture. These feature vectors are then used for classifying a probe gestures using support vector machine (SVM) and K nearest neighbour classifiers. Performance of our proposed methodology has been tested on in house ISL datasets as well as on Sheffield Kinect gesture dataset. From experimental results we observed that WD with MFCC method provides high recognition rate as compare to other existing techniques [MFCC, orientation histogram (OH)]. Subsequently, ISL gestures have been transferred to a Humanoid HOAP-2 (humanoid open architecture platform) robot in Webots simulation platform. Then these gestures are imitated by HOAP-2 robot exactly in a same manner.
Article
Full-text available
A normal human being sees, listens, and reacts to his/her surroundings. There are some individuals who do not have this important blessing. Such individuals, mainly deaf and dumb, depend on communication via sign language to interact with others. However, communication with ordinary individuals is a major concern for them since not everyone can comprehend their sign language. Furthermore, this will cause a problem for the deaf and dumb communities to interact with others, particularly when they attempt to involve with educational, social and work environments. In this research, the objectives are to develop a sign language translation system in order to assist the hearing or speech impaired people to communicate with normal people, and also to test the accuracy of the system in interpreting the sign language. As a first step, the best method in gesture recognition was chosen after reviewing previous researches. The configuration of the data glove includes 10 tilt sensors to capture the finger flexion, an accelerometer for recognizing the motion of the hand, a microcontroller and Bluetooth module to send the interpreted information to a mobile phone. Firstly the performance of the tilt sensor was tested. Then after assembling all connections, the accuracy of the data glove in translating some selected alphabets, numbers and words from Malaysian Sign Language is performed. The result for the first experiment shows that tilt sensor need to be tilted more than 85 degree to successfully change the digital state. For the accuracy of 4 individuals who tested this device, total average accuracy for translating alphabets is 95%, numbers is 93.33% and gestures is 78.33%. The average accuracy of data glove for translating all type of gestures is 89%. This fusion of tilt sensors and accelerometer could be improved in the future by adding more training and test data as well as underlying frameworks such as Hidden Markov Model.
Article
Full-text available
The task of human hand trajectory tracking and gesture trajectory recognition based on synchronized color and depth video is considered. Toward this end, in the facet of hand tracking, a joint observation model with the hand cues of skin saliency, motion and depth is integrated into particle filter in order to move particles to local peak in the likelihood. The proposed hand tracking method, namely, salient skin, motion, and depth based particle filter (SSMD-PF), is capable of improving the tracking accuracy considerably, in the context of the signer performing the gesture toward the camera device and in front of moving, cluttered backgrounds. In the facet of gesture recognition, a shape-order context descriptor on the basis of shape context is introduced, which can describe the gesture in spatiotemporal domain. The efficient shape-order context descriptor can reveal the shape relationship and embed gesture sequence order information into descriptor. Moreover, the shape-order context leads to a robust score for gesture invariant. Our approach is complemented with experimental results on the settings of the challenging hand-signed digits datasets and American sign language dataset, which corroborate the performance of the novel techniques.
Article
Full-text available
In early days computers are operated by various interface devices, which are developed by the humans to interact with computers. Starting from Punch-cards to touch screens man has changed the human life into an unimaginable state, In this paper, a novel method for dynamic hand gesture recognition based on human computer interface intelligent system is proposed. The main objective is to interact with computers without using mouse clicks and keystrokes. An architecture for hand posture, gesture modelling and recognition system is introduced, which is used as an interface to make possible communication with the sensory challenged (hearing impairment and gustatory impairment) people by simple hand gestures. This proposed system first transforms the pre-processed data of the detected hand into a fuzzy hand-posture feature model by using fuzzy neural networks, Second with the proposed model, the developed system determines the actual hand posture by applying fuzzy inference. Finally, from the sequence of detected hand postures, the system will recognize the hand gesture of the user. Moreover, the computer vision techniques are developed to recognize a dynamic hand gestures that make interpretations in the form of commands or actions. (C) 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Noorul Islam Centre for Higher Education
Conference Paper
There is an undeniable communication problem between the Deaf community and the hearing majority. Innovations in automatic sign language recognition try to tear down this communication barrier. Our contribution considers a recognition system using the Microsoft Kinect, convolutional neural networks (CNNs) and GPU acceleration. Instead of constructing complex handcrafted features, CNNs are able to automate the process of feature construction. We are able to recognize 20 Italian gestures with high accuracy. The predictive model is able to generalize on users and surroundings not occurring during training with a cross-validation accuracy of 91.7%. Our model achieves a mean Jaccard Index of 0.789 in the ChaLearn 2014 Looking at People gesture spotting competition.
Article
Needs and new technologies always inspire people to make new ways to interact with machines. This interaction can be for a specific purpose or a framework which can be applied to many applications. Sign language recognition is a very important area where an easiness in interaction with human or machine will help a lot of people. At this time, India has 2.8M people who can’t speak or can’t hear properly. This paper targets Indian sign recognition area based on dynamic hand gesture recognition techniques in real-time scenario. The captured video was converted to HSV color space for pre-processing and then segmentation was done based on skin pixels. Also Depth information was used in parallel to get more accurate results. Hu-Moments and motion trajectory were extracted from the image frames and the classification of gestures was done by Support Vector Machine. The system was tested with webcam as well as with MS Kinect. This type of system would be helpful in teaching and communication of hearing impaired persons.