Conference PaperPDF Available

Sign sentence recognition with smart watches

Authors:

Abstract and Figures

We present a smartwatch application that recognizes important sign sentences. We make use of modern smart watches like Samsung Gear that are equipped with inbuilt sensors including accelerometer, gyroscope and magnetometer. We show how well a smartwatch can recognize important sign sentences. We have implemented a smartwatch app that collects 3d accelerometer and 3d gyroscope data from the watch. We have recorded 8 questions and 13 sentences from 5 people who are fluent in sign language. We use dynamic time warping to compute the distances between the gestures and templates in all data dimensions. The resulting distances serve as input for discriminating the gestures. We evaluated the discriminative ability of logistic regression. In person-independent evaluations, we achieve discrimination accuracies between 85.7% and 97.6%.
Content may be subject to copyright.
Sign Sentence Recognition with Smart Watches
Deniz Ekiz, Gamze Ege Kaya, Serkan Buğur, Sıla Güler, Buse Buz, Bilgin Kosucu, Bert Arnrich
Computer Engineering Department, Boğaziçi University
Istanbul, Turkey
AbstractWe present a smartwatch application that
recognizes important sign sentences. We make use of modern
smart watches like Samsung Gear that are equipped with inbuilt
sensors including accelerometer, gyroscope and magnetometer.
We show how well a smartwatch can recognize important sign
sentences. We have implemented a smartwatch app that collects
3d accelerometer and 3d gyroscope data from the watch. We have
recorded 8 questions and 13 sentences from 5 people who are
fluent in sign language. We use dynamic time warping to compute
the distances between the gestures and templates in all data
dimensions. The resulting distances serve as input for
discriminating the gestures. We evaluated the discriminative
ability of logistic regression. In person-independent evaluations,
we achieve discrimination accuracies between 85.7% and 97.6%.
Keywordssmart watch, sign language, dynamic time warping
I. INTRODUCTION
Deaf people express themselves in sign language when
communicating with other people. Sign language consists of
different modalities including arm gestures, hand shapes, upper
body posture and/or facial expressions. Sign language
recognition aims to develop methods to correctly identify a
sequence of signs [1]. Many researchers using computer vision
for recognizing sign language since it can capture all modalities.
A main drawback of vision-based systems is the need of a fixed
infrastructure. In many daily life situations, e.g. getting
directions or asking for medical support, the required
infrastructure is not available. In order to overcome this
drawback, in recent years researchers experimented with
wearable sensor technologies that deaf people could always have
with them. For example, inertial measurement units mounted on
the wrist were used to recognize arm gestures. However, many
of the investigated technologies were dedicated high-end sensor
systems that have no additional benefit for the user. In our work
we make use of common out-of-the-box smart watches for
recognizing sign sentences. Modern smart watches are equipped
with inbuilt sensors including accelerometer, gyroscope and
magnetometer. These sensors provide less data than high-end
systems (sampling frequency 20Hz vs. 200+ Hz) but their rich
functionality make them attractive for wearing it every day.
II. RELATED WORK
Recently, the authors of [2] proposed a system called
SCEPTRE for recognizing sign language in user-to-user
communication and user-to-computer interaction. They use two
forearm-worn devises to gather accelerometer, electromyogram
(EMG) and orientation data. The collected data is sent to a
smartphone and from there to a server for data processing. In a
training phase, gesture data is stored in a database for later
comparisons. Dynamic time warping is used for template
comparisons of accelerometer, gyroscope and
electromyography data. The system achieves an accuracy of
97.72 % for discriminating 20 gestures.
In a similar approach presented in [3], the authors use a
wearable inertial measurement unit (IMU) and surface
electromyography (sEMG) for recognizing American Sign
Language (ASL). A support vector machine classifier achieves
on average 96.16% person-dependent accuracy for
discriminating 80 ASL signs.
In contrast to related work we rely on low resolution IMU
data collected with a common smartwatch.
III. METHODS
In the following we present our methods for data collection
and data processing.
A. Data collection
In cooperation with the Department of Linguistic at Boğaziçi
University, we defined 8 questions and 13 sentences that are
helpful for deaf people to handle crucial situations in daily life
(see Figure 1). Five participants (three female) who were fluent
in Turkish Sign Language (TSL) were recruited for data
collection. Four of the participants are native deaf signer. One of
the native deaf signers is a sign language teacher.
Are you ok? How can I get off the bus? Do you have
paper? Can you please help me? In which direction is
the metro? In which direction is the police? In which
direction is the hospital? Where is the exit?
I’m fine. See you. Take me to hospital. I get sick. I’m
deaf. I’m lost. I’m thirsty. I’m hungry. I’m tired. I
need to go to this address. My wallet is lost. I don’t
know. This is an emergency.
Figure 1 Important questions and sentences
Table 1 Data format: Timestamp, 3d acceleration both without and
with the gravitational acceleration, 3d rotation
Timestamp
ax
ay
az
gx
gy
gz
rα
978-1-5090-6494-6/17/$31.00 ©2017 IEEE
We used Samsung Gear S2 smartwatches for data recording.
Gear S2 has a rich set of built-in sensors: 3d accelerometer, 3d
gyroscope, light sensor, barometer, pedometer and A-GPS. Gear
S2 runs on Tizen Wearable operating system and is programmed
with JavaScript. The data collection from the sensors is event
driven, and consequently constant sampling cannot be
guaranteed. The average data sampling for Gear S2 IMU data is
21 Hz. Both Gear S2 and Tizen are relatively new and hence
fairly unexplored platforms. We have implemented our own data
collection app on Gear S2. To the best of our knowledge, we are
the first to introduce a fully automated data collection app on
Gear S2, exploiting all its sensors and interfaces.
For the recognition of sign language sentences, we are in
particular interested in the data from the 3d accelerometer and
3d gyroscope sensors. Gear S2 provides 3d acceleration both
without and with the gravitational acceleration. Our data
collection app collects both acceleration variants and gyroscope
data together with a respective timestamp (see Table 1Table 1).
We asked each participant to wear two Gear S2
smartwatches on their right and left wrist respectively. The
experiment leader asked each participant to repeat each of the
sign sentences 10 times (see Figure 2). Thus, from each of the 5
participants we collect a total of (8 questions + 13 sentences) x
10 repetitions = 210 instances. Between two repetitions,
participants were asked not to move for approx. one second in
order to indicate stop and start of a repetition. The recording
sessions were recorded with a GoPro camera for later inspection.
IV. DATA PROCESSING
In a first step, data was segmented into single repetitions.
Segmentation was performed on normalized linear acceleration
magnitude values. As a pre-processing step, Savitzky-Golay
filter [4] was applied on magnitude values to increase signal-to-
noise ratio and to remove outliers. In Figure 3 the acceleration
magnitude of a sign sentence performed by two subjects is
shown before and after applying the filter. In a sliding window
approach, we compute the mean of the smoothed magnitude
values. If the window mean falls below 2 standard deviations of
the overall mean, we consider it as an end point of the repetition.
If the window mean is again within 2 standard deviations, we
consider it as a start point of a new repetition. An exemplary
segmentation is shown in Figure 4. We visually inspected the
Figure 2 Recording setup: each participant wears two Gear S2
smartwatches on their right and left wrist respectively. The
experiment leader asked each participant to repeat each of the sign
sentences 10 times.
Figure 4 Automatic segmentation of 10 repetitions: blue
line marks the begin of a repetition, purple marks the end
Figure 3 Acceleration magnitude of a sign sentence
performed by two subjects before (above) and after
applying the filter
resulting segmentations. In 10% of the cases we had to manually
correct the segmentation. We use the time stamps of start and
end points to segment all data modalities.
We use the Dynamic Time Warping (DTW) approach to
compute the distance between a data segment and a data
template. DTW is a commonly used method to compute
similarities between time series that are not perfectly aligned.
DTW warps two time series and match the indices of similar
patterns as shown in Figure 5. The resulting similarity measure
is the summarized distance of all matching points. We employed
the DTW package implementation from [5] which is available
in R. We created a data template for each of the 21 sign sentences
under study by using the averaging template method proposed
in [6]. In other words, we obtained a template for each sentence
in each signal dimension by generating a template with Cross-
Words Reference Template(CWRT) method.
For each data segment, we compute the DTW similarities to
all 21 data templates in each of the 9 raw signal dimensions plus
the computed acceleration magnitude. In case the data segment
and data template belong to the same sentence, we set the class
label C=1, otherwise C=0. The resulting data set for one data
template in shown in Table 2. In case the data segment is from
the same sentence as the template, the DTW similarities tend to
be smaller.
We use logistic regression since it is a well-known and
powerful method to estimate multivariate models with binary
response variables. As input variables we use the similarity
measures shown in Table 2 and as binary response variable we
use the class label. For classifying a new data segment we first
compute the DTW similarities between the segment and all 21
data templates. Next, we apply the logistic regression model on
the computed similarity data, i.e. we apply it for all 21 similarity
vectors. Finally, we decide for the resulting class in two
alternative ways: (1) in the threshold approach, we check if
there is exactly one winner model that has a response value
larger than 0.5 and in case the winner model defines the
predicted class; (2) in the max response approach, the winner
model is the one with the highest response value.
We evaluate the accuracy of the logistic regression models
by using data from person i to create a model and apply the
model to data from person j to compute model accuracy. We
perform this evaluation in two variants: (1) in the independent
variant, we use data templates from person i to compute the
DTW similarities; (2) in the semi-independent way, we use data
from person j to compute the DTW similarities.
V. RESULTS
In this paper we consider DTW similarities based on 3d
acceleration with gravity as input variables for the logistic
regression models. We evaluated both threshold and max
Figure 5 Dynamic Time Warping of acceleration
magnitudes of the sign sentence “I’m fine” performed by
the same subject
Table 2 DTW similarities between a data segment S from the first sentence
and all 21 data templates Ti in each of the 9 raw signal dimensions plus the
computed acceleration magnitude m. In case the data segment and data
template belong to the same sentence, we set the class label C=1, otherwise
C=0. In case C=0, the DTW similarities tend to be smaller.
ax
ay
az
gx
gy
gz
rα
rβ
rx
m
C
DTW
(S, T1)
1.1
0.9
1.0
1.4
1.1
1.5
36.7
36.2
39.4
0.9
1
DTW
(S, T2)
1.2
0.9
0.9
1.5
1.1
1.3
47.3
43.9
37.4
0.9
0
0
DTW
(S, T21)
1.1
1.3
1.4
1.6
1.5
1.6
52.2
41.2
46.7
1.0
0
Figure 6 Independent threshold accuracy from logistic regression
models that were created with data from person i (row) and applied
to data from person j (column)
Figure 7 Semi-independent threshold accuracy from logistic
regression models that were created with data from person i (row) and
applied to data from person j (column).
response approaches. All results are derived from logistic
regression models that were created with data from person i and
applied to data from person j. With 21 classes, a-priori
probability is approx. 5%. In the following we present confusion
matrices that show the discrimination accuracies when models
were created with data from person i (row) and applied to data
from person j (column).
In Figure 6 the confusion matrix of the independent
threshold evaluation is shown. It can be observed that only in
cases when train and test data are coming from the same person,
the model accuracies are above a-priori. The highest accuracy of
46.2% is achieved with data from person 1 which is a sign
teacher. In Figure 7 the confusion matrix of the semi-
independent threshold evaluation is shown. It can be observed
that now all evaluations achieve results above a-priori. In Figure
8 the confusion matrix of the independent max response
evaluation is shown. It can be observed the classification is very
accurate if when train and test data are coming from the same
person. In Figure 9 the confusion matrix of the semi-independent
max response evaluation is shown. It can be observed that now
in all evaluations, very high accuracy values are achieved.
VI. CONCLUSION AND DISCUSSION
In this contribution we showed how well a common
smartwatch can recognize important sign sentences. With the
standard logistic regression threshold method, we achieved
46.2% discrimination accuracy with data from the sign-language
teacher (first participant). However, for the other participants the
accuracies are lower (32.4%, 30.0%, 27.6%, 15.7%). In the
person-independent evaluation, the standard threshold method
resulted in very low discrimination accuracies.
The max response approach provided with 96.7%, 96.2%,
95.7%, 91.9%, and 91.0% percent very high accuracies for all
participants respectively. In a person-independent evaluation,
this method was always better than a-priori, but with accuracies
between 8.1% and 33.8% the results are too low to work well in
practice.
In a semi-person-independent evaluation, the standard
threshold approach achieved accuracies between 7.1% and
45.7% which are too low to work in practice. In a semi-person-
independent evaluation, the max-response approach achieved
accuracies between 85.7% and 97.6% which would work well in
practice.
Overall, we can conclude that semi-person-independent
approach achieves reasonable accuracy: a new user has to record
once a set of repetitions of each sentence. From the recorded
data, templates are created. The max response models that were
created from data of other users can be used to achieve
reasonable accuracies.
In order to further increase the system accuracy, person-
dependent max response method should be used: a new user has
to record once a set of repetitions of each sentence. From the
recorded data templates are created and a logistic regression
model is computed.
In future work we will extend the smartwatch app with real-
time classification ability.
ACKNOWLEDGMENT
We are very grateful for the support of Prof. A. Sumru Özsoy
from the Department of Linguistics at Boğaziçi University. We
thank all our participants for their time and help.
REFERENCES
[1] H. Cooper, B. Holt and R. Bowden, “Sign Language Recognition”,
Chapter in Visual Analysis of Humans: Looking at People, Springer,
2011, pp. 539 - 562
[2] P. Paudyal, A. Banerjee, and S.K.S. Gupta, SCEPTRE: A Pervasive,
Non-Invasive, and Programmable Gesture Recognition Technology,”
Proceedings of the 21st International Conference on Intelligent User
Interfaces, 2016, pp. 282-293
[3] J. Wu, and L. Sun, “A Wearable System for Recognizing American Sign
Language in Real-Time Using IMU and Surface EMG Sensors,” IEEE
Journal of Biomedical and Health Informatics, Volume: 20, Issue: 5, Sept.
2016 ), pp. 1281-1290
[4] Savitzky, Golay, “Smoothing and Differentiation of Data by Simplified
Least Squares Procedures,” Anal. Chem. 36 (8), 1964, pp. 1627-1639
[5] Toni Giorgino, Computing and Visualizing Dynamic Time Warping
Alignments in R: The dtw Package
[6] W.H. Abdulla, D. Chow, and G. Sin, Cross-words Reference Template
for DTW-based Speech Recognition Systems, TENCON 2003.
Conference on Convergent Technologies for Asia-Pacific Region, 2003,
pp. 1576-1579 Vol.4.
Figure 8 Independent max response accuracy from logistic regression
models that were created with data from person i (row) and applied to
data from person j (column)
Figure 9 Semi-independent max response accuracy from logistic
regression models that were created with data from person i (row) and
applied to data from person j (column)
... Most of these techniques used either gloves alone (Liang and Ouhyoung, 1998;Kong and Ranganath, 2014;Ritchings et al., 2012;Mohandes and Deriche, 2013) or used gloves with hand tracker sensors (Gao et al., 2004b;Wang et al., 2001;Fang et al., 2001;Yao et al., 2006). Other techniques employed other sensors that provide data (accelerometer and gyroscope) about the performed signs such as Smartwatches sensors (Ekiz et al., 2017) and wearable IMU system (Suri and Gupta, 2019). ...
... Smart watch sensors were used for data collection by Ekiz et al. (2017). These sensors generate accelerometer and gyroscope data that are used as features and fed into LR classifier. ...
Article
Full-text available
Sign language relies on visual gestures of human body parts to convey meaning and plays a vital role in modern society to communicate and interact with people having hearing difficulty as well as for human-machine interaction applications. This field has attracted a growing attention in recent years and several research outcomes have been witnessed covering various issues including sign acquisition, segmentation, recognition, translation and linguistic structures. In this paper, a comprehensive up-to-date survey of the state-of-the-art literature of automated sign language processing is presented. The survey provides a taxonomy and review of the body of knowledge and research efforts with focus on acquisition devices, available databases, and recognition techniques for fingerspelling signs, isolated sign words, and continuous sentence recognition systems. It covers recent advances including deep machine learning and multimodal approaches and discusses various related challenges. This survey is directed to junior researchers and industry developers working on sign language gesture recognition and related systems to gain insights and identify distinctive aspects and current status of existing landscape as well as future perspectives leading to further advancements. https://www.sciencedirect.com/science/article/pii/S0952197622002925?dgcid=coauthor
... Most of these devices are equipped with Inertial Measurement Unit (IMU) motion sensors like Accelerometer and Gyroscope, in addition to other non-inertial sensors like Magnetometer, Barometer, etc. These sensors can be used to classify hand moves and activities which have many applications like Human Computer Interaction (HCI) [2,3,4], sign-language recognition [5,6,7], sport actions monitoring [8,9,10,11], or for monitoring daily-life actions [12,13]. Smartwatches and wearable sensors were used previously in literature to classify hand gestures and actions. ...
Article
Full-text available
As per World Health Organization (WHO), avoiding touching the face when people are in public or crowded places is an effective way to prevent respiratory viral infections. This recommendation has become more crucial with the current health crisis and the worldwide spread of COVID-19 pandemic. However, most face touches are done unconsciously, that is why it is difficult for people to monitor their hand moves and try to avoid touching the face all the time. Hand-worn wearable devices like smartwatches are equipped with multiple sensors that can be utilized to track hand moves automatically. This work proposes a smartwatch application that uses small, efficient, and end-to-end Convolutional Neural Networks (CNN) models to classify hand motion and identify Face-Touch moves. To train the models, a large dataset is collected for both left and right hands with over 28k training samples that represents multiple hand motion types, body positions, and hand orientations. The app provides real-time feedback and alerts the user with vibration and sound whenever attempting to touch the face. Achieved results show state of the art face-touch accuracy with average recall, precision, and F1-Score of 96.75%, 95.1%, 95.85% respectively, with low False Positives Rate (FPR) as 0.04%. By using efficient configurations and small models, the app achieves high efficiency and can run for long hours without significant impact on battery which makes it applicable on most off-the-shelf smartwatches.
... Since then, research has been conducted on applying different approaches and different devices in gesture-based SLR. In 2017, Ekiz et al. [25] firstly attempted to capture the hand movements of signers with smart watches and used dynamic time warping (DTW) to compute the distances between the gestures and the templates in different dimensions for SLR. ...
Article
Full-text available
Continuous sign language recognition (CSLR) using different types of sensors to precisely recognize sign language in real time is a very challenging but important research direction in sensor technology. Many previous methods are vision-based, with computationally intensive algorithms to process a large number of image/video frames possibly contaminated with noises, which can result in a large translation delay. On the other hand, gesture-based CSLR relying on hand movement data captured on wearable devices may require less computation resources and translation time. Thus, it is more efficient to provide instant translation during real-world communication. However, the insufficient amount of information provided by the wearable sensors often affect the overall performance of this system. To tackle this issue, we propose a bidirectional long short-term memory (BLSTM)-based multi-feature framework for conducting gesture-based CSLR precisely with two smart watches. In this framework, multiple sets of input features are extracted from the collected gesture data to provide a diverse spectrum of valuable information to the underlying BLSTM model for CSLR. To demonstrate the effectiveness of the proposed framework, we test it on an extremely challenging and radically new dataset of Hong Kong sign language (HKSL), in which hand movement data are collected from 6 individual signers for 50 different sentences. The experimental results reveal that the proposed framework attains a much lower word error rate compared with other existing machine learning or deep learning approaches for gesture-based CSLR. Based on this framework, we further propose a portable sign language collection and translation platform, which can simplify the procedure of collecting gesture-based sign language dataset and recognize sign language through smart watch data in real time, in order to break the communication barrier for the sign language users.
... We developed a data collection application for the Tizen Platform Wearable 2.3.2. The acceleration data collection application was developed in our previous works [13], [14] and [5]. The sampling rate of the 3D accelerometer is 20 Hz. ...
Article
Continuous high perceived workload has a negative impact on the individual's well-being. Prior works focused on detecting the workload with medical-grade wearable systems in restricted settings, and the effect of applying deep learning techniques for perceived workload detection in the wild settings is not investigated. We present an unobtrusive, comfortable, pervasive, and affordable Long Short-Term Memory Network based continuous workload monitoring system based on a smartwatch application that monitors the perceived workload of individuals in the wild. We have recorded physiological data from daily life with perceived workload questionnaires from subjects in their real-life environments over a month. The model was trained and evaluated with the daily-life physiological data coming from different days, which makes it robust to daily changes in the heart rate variability that we use with accelerometer features to assess low and high workload. Our system has the capability of detecting perceived workload by using traditional and deep classifiers. We discussed the problems related to 'in the wild' applications with the consumer-grade smartwatches. We showed that Long Short-Term Memory Network with feature extraction outperforms traditional classifiers and Convolutional Neural Networks on discrimination of low and high perceived workload with smartwatches in the wild.
Article
Gesture is a new communication form of human–computer interaction access because of its abundance and diversity. Continuous gesture recognition based on biosignals has gained widespread attention. However, there are two challenges: the movement epenthesis of continuous gestures leads to the deformations of original gestures, and the different signing speeds among various people lead to the diversity of signal length. To solve them, CG-Recognizer: a biosignal-based continuous gesture recognition system is proposed. To the first challenge, gesture signals are transformed into spectrograms, and a feature generator based on a channel-separated convolutional neural network is constructed to extract the spatio-temporal features of gesture signals. For the second challenge, a standard deviation-based signal segmentation algorithm is first proposed to segment signals and label the features of signals. Then, the labeled signal features are sent to the You Only Look Once version 5 (YOLOv5) model for gesture recognition. The experimental results indicate that the mean accuracy of CG-Recognizer is over 94% on 50 commonly used discrete gestures and over 98% on 40 continuous gestures composed of the above gestures.
Article
Full-text available
Smartwatch (SW) is a wearable gadget used in everyday life. It is equivalent to a customary wristwatch and offers features similar to a smartphone. These features include access to the internet, making a call, weather updates, text or video messages, GPS navigation, health, physical fitness information, etc. The market for wearable gadgets and SWs is increasing day by day. There are many benefits and drawbacks of smartwatches in medical environments. This article aims to review the existing literature on smart-watch applications in the biomedical sector. As a result, this paper compiles all the information regarding SW technology in the biomedical sector and related journal articles, conference proceedings, web materials , market survey reports, and history in one place. The review shows that the demand rate of SWs in India is less as compared to other countries. It is increasing with the years, but the overall rate of demand is not as expected. It varies with the user's choice. The vital author keyword and index keywords are presented. The abstract and title words are also analyzed with Vosviewer 1.6.16 software. The attributes/cri-teria associated with SWs are also discussed, which helps in purchasing them. The paper also discusses the flaws and problems that need to be resolved in the present situation and several potential alternatives for prospective research projects.
Article
In this work we present SUGO, a depth video-based system for translating sign language to text using a smartphone's front camera. While exploiting depth-only videos offer benefits such as being less privacy-invasive compared to using RGB videos, it introduces new challenges which include dealing with low video resolutions and the sensors' sensitiveness towards user motion. We overcome these challenges by diversifying our sign language video dataset to be robust to various usage scenarios via data augmentation and design a set of schemes to emphasize human gestures from the input images for effective sign detection. The inference engine of SUGO is based on a 3-dimensional convolutional neural network (3DCNN) to classify a sequence of video frames as a pre-trained word. Furthermore, the overall operations are designed to be light-weight so that sign language translation takes place in real-time using only the resources available on a smartphone, with no help from cloud servers nor external sensing components. Specifically, to train and test SUGO, we collect sign language data from 20 individuals for 50 Korean Sign Language words, summing up to a dataset of ~5,000 sign gestures and collect additional in-the-wild data to evaluate the performance of SUGO in real-world usage scenarios with different lighting conditions and daily activities. Comprehensively, our extensive evaluations show that SUGO can properly classify sign words with an accuracy of up to 91% and also suggest that the system is suitable (in terms of resource usage, latency, and environmental robustness) to enable a fully mobile solution for sign language translation.
Article
The number of fitness tracker users increases every day. Most of the applications require authentication to protect privacy-preserving operations. Biometrics such as face images have been used widely as login tokens, but they have privacy issues. Moreover, occlusions like face masks used for COVID may reduce their effectiveness. Smartbands can track heart rate, movements, and electrodermal activities. They have been widely used for health-related applications. The use of smartbands for authentication is in the exploratory stage. Physiological signals gathered from smartbands may be used to create a multi-modal and multi-sensor authentication system. The popularity of smartbands enables us to deploy new applications without a need to buy additional hardware. In this study, we explore the multi-modal physiological biometrics with end-to-end deep learning and feature-based traditional systems. We collected multi-modal physiological data of 80 people for five days using modern smartbands. We applied a deep learning approach to the multi-modal physiological data and used feature-based traditional machine learning classifiers. The CNN-LSTM model achieved a 9.31% equal error rate and outperformed other models in terms of authentication performance.
Article
Sign language recognition (SLR) bridges the communication gap between the hearing-impaired and the ordinary people. However, existing SLR systems either cannot provide continuous recognition or suffer from low recognition accuracy due to the difficulty of sign segmentation and the insufficiency of capturing both finger and arm motions. The latest system, SignSpeaker, has a significant limit in recognizing two-handed signs with only \emph{one} smartwatch. To address these problems, this paper designs a novel real-time end-to-end SLR system, called DeepSLR, to translate sign language into voices to help people ‘'hear’' sign language. Specifically, two armbands embedded with an IMU sensor and multi-channel sEMG sensors are attached on the forearms to capture both coarse-grained arm movements and fine-grained finger motions. We propose an attention-based encoder-decoder model with a multi-channel convolutional neural network (CNN) to realize accurate, scalable, and end-to-end continuous SLR without sign segmentation. We have implemented DeepSLR on a smartphone and evaluated its effectiveness through extensive evaluations. The average word error rate of continuous sentence recognition is 6.6\%, and it takes less than 1.1s for detecting signals and recognizing a sentence with 4 sign words, validating the recognition efficiency and real-time ability of DeepSLR in real-world scenarios.
Chapter
Full-text available
This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data sets.
Article
Full-text available
Dynamic time warping is a popular technique for comparing time series, providing both a distance measure that is insensitive to local compression and stretches and the warping which optimally deforms one of the two input series onto the other. A variety of algorithms and constraints have been discussed in the literature. The dtw package provides an unification of them; it allows R users to compute time series alignments mixing freely a variety of continuity constraints, restriction windows, endpoints, local distance definitions, and so on. The package also provides functions for visualizing alignments and constraints using several classic diagram types.
Conference Paper
Full-text available
One of the main problems in dynamic time-warping (DTW) based speech recognition systems are the preparation of reliable reference templates for the set of words to be recognised. This paper presents a simple novel technique for preparing reliable reference templates to improve the recognition rate score. The developed technique produces templates called crosswords reference templates (CWRTs). It extracts the reference template from a set of examples rather than one example. This technique can be adapted to any DTW-based speech recognition systems to improve its performance. The speaker-dependent recognition rate, as tested on the English digits, is improved from 85.3%, using the traditional technique to 99%, using the developed technique.
Article
A Sign Language Recognition (SLR) system translates signs performed by deaf individuals into text/speech in real time. Inertial measurement unit (IMU) and surface electromyography (sEMG) are both useful modalities to detect hand/arm gestures. They are able to capture signs and the fusion of these two complementary sensor modalities will enhance system performance. In this paper, a wearable system for recognizing American Sign Language (ASL) in real-time is proposed, fusing information from an inertial sensor and sEMG sensors. An information gain based feature selection scheme is used to select the best subset of features from a broad range of well-established features. Four popular classification algorithms are evaluated for 80 commonly used ASL signs on four subjects. The experimental results show 96.16% and 85.24% average accuracies for intra-subject and intra-subject cross session evaluation respectively, with the selected feature subset and a support vector machine classifier. The significance of adding sEMG for American Sign Language recognition is explored and the best channel of sEMG is highlighted.
Conference Paper
Communication and collaboration between deaf people and hearing people is hindered by lack of a common language. Although there has been a lot of research in this domain, there is room for work towards a system that is ubiquitous, non- invasive, works in real-time and can be trained interactively by the user. Such a system will be powerful enough to translate gestures performed in real-time, while also being flexible enough to be fully personalized to be used as a platform for gesture based HCI. We propose SCEPTRE which utilizes two non-invasive wrist- worn Myo devices to decipher gesture-based communication. The system uses a multitiered template based comparison system for classification on input data from accelerometer, gyroscope and electromyography (EMG) sensors. This work demonstrates that the system is very easily trained using just one to three training instances each for twenty randomly chosen signs from the American Sign Language(ASL) dictionary and also for user-generated custom gestures. The system is able to achieve an accuracy of 97.72 to 100 %.
Article
In attempting to analyze, on digital computers, data from basically continuous physical experiments, numerical methods of performing familiar operations must be developed. The operations of differentiation and filtering are especially important both as an end in themselves, and as a prelude to further treatment of the data. Numerical counterparts of analog devices that perform these operations, such as RC filters, are often considered. However, the method of least squares may be used without additional computational complexity and with considerable improvement in the information obtained. The least squares calculations may be carried out in the computer by convolution of the data points with properly chosen sets of integers. These sets of integers and their normalizing factors are described and their use is illustrated in spectroscopic applications. The computer programs required are relatively simple. Two examples are presented as subroutines in the FORTRAN language.
Sign Language Recognition " , Chapter in Visual Analysis of Humans: Looking at People
  • H Cooper
  • B Holt
  • R Bowden
H. Cooper, B. Holt and R. Bowden, " Sign Language Recognition ", Chapter in Visual Analysis of Humans: Looking at People, Springer, 2011, pp. 539-562