Conference Paper

How Fast Is Sign Language? A Reevaluation of the Kinematic Bandwidth Using Motion Capture

Authors:
  • LISN (ex LIMSI/CNRS) Université Paris-Saclay
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Moreover, SL productions made in isolation hardly provide complete descriptions of how SL is used by signers in real-life conditions, even for more complex signs than letters or numbers. For instance, it has been demonstrated that spontaneous SL mocap recordings could reveal faster movements than isolated signs [34][35][36]. Furthermore, SL motion involves the coordination of far more body parts than the dominant hand, including the other hand, but also torso, head, shoulders and arms. ...
... Signers could thus express themselves freely, which elicited a wider variety of SL linguistic forms (e.g., lexical signs, but also depicting signs that describe size and shapes of entities), by contrast with the highly constrained movements (i.e., ASL letters) assessed in previous studies on hand gestures. Furthermore, the present SL movements were recorded in context, within continuous discourses, which is known to give rise to further motion features and coordination properties, by contrast with isolated productions like ASL letters [34][35][36]. The PMs extracted from our motion dataset are thus more likely to reflect potential synergies used by signers in real-life conditions, which is crucial notably to conceive efficient real-life communication tools. ...
Article
Full-text available
Sign Language (SL) is a continuous and complex stream of multiple body movement features. That raises the challenging issue of providing efficient computational models for the description and analysis of these movements. In the present paper, we used Principal Component Analysis (PCA) to decompose SL motion into elementary movements called principal movements (PMs). PCA was applied to the upper-body motion capture data of six different signers freely producing discourses in French Sign Language. Common PMs were extracted from the whole dataset containing all signers, while individual PMs were extracted separately from the data of individual signers. This study provides three main findings: (1) although the data were not synchronized in time across signers and discourses, the first eight common PMs contained 94.6% of the variance of the movements; (2) the number of PMs that represented 94.6% of the variance was nearly the same for individual as for common PMs; (3) the PM subspaces were highly similar across signers. These results suggest that upper-body motion in unconstrained continuous SL discourses can be described through the dynamic combination of a reduced number of elementary movements. This opens up promising perspectives toward providing efficient automatic SL processing tools based on heavy mocap datasets, in particular for automatic recognition and generation.
... The positions of all markers were defined in reference to the pelvis (used as the origin) and low-pass filtered using a fourth-order Butterworth filter with a cutoff frequency of 12 Hz, following recent estimations of SL kinematic bandwidth (Bigand et al., 2021). From each of the 24 original recordings, one mocap recording unit with a duration of 5 s was extracted from the beginning of the utterance, irrespective of the semantic content. ...
Article
Full-text available
Sign language (SL) motion contains information about the identity of a signer, as does voice for a speaker or gait for a walker. However, how such information is encoded in the movements of a person remains unclear. In the present study, a machine learning model was trained to extract the motion features allowing for the automatic identification of signers. A motion capture (mocap) system recorded six signers during the spontaneous production of French Sign Language (LSF) discourses. A principal component analysis (PCA) was applied to time-averaged statistics of the mocap data. A linear classifier then managed to identify the signers from a reduced set of principal components (PCs). The performance of the model was not affected when information about the size and shape of the signers were normalized. Posture normalization decreased the performance of the model, which nevertheless remained over five times superior to chance level. These findings demonstrate that the identity of a signer can be characterized by specific statistics of kinematic features, beyond information related to size, shape, and posture. This is a first step toward determining the motion descriptors necessary to account for the human ability to identify signers.
Thesis
Full-text available
Sign Languages (SLs) have developed naturally in Deaf communities. With no written form, they are oral languages, using the gestural channel for expression and the visual channel for reception. These poorly endowed languages do not meet with a broad consensus at the linguistic level. These languages make use of lexical signs, i.e. conventionalized units of language whose form is supposed to be arbitrary, but also - and unlike vocal languages, if we don't take into account the co-verbal gestures - iconic structures, using space to organize discourse. Iconicity, which is defined as the existence of a similarity between the form of a sign and the meaning it carries, is indeed used at several levels of SL discourse.Most research in automatic Sign Language Recognition (SLR) has in fact focused on recognizing lexical signs, at first in the isolated case and then within continuous SL. The video corpora associated with such research are often relatively artificial, consisting of the repetition of elicited utterances in written form. Other corpora consist of interpreted SL, which may also differ significantly from natural SL, as it is strongly influenced by the surrounding vocal language.In this thesis, we wish to show the limits of this approach, by broadening this perspective to consider the recognition of elements used for the construction of discourse or within illustrative structures.To do so, we show the interest and the limits of the corpora developed by linguists. In these corpora, the language is natural and the annotations are sometimes detailed, but not always usable as input data for machine learning systems, as they are not necessarily complete or coherent. We then propose the redesign of a French Sign Language dialogue corpus, Dicta-Sign-LSF-v2, with rich and consistent annotations, following an annotation scheme shared by many linguists.We then propose a redefinition of the problem of automatic SLR, consisting in the recognition of various linguistic descriptors, rather than focusing on lexical signs only. At the same time, we discuss adapted metrics for relevant performance assessment.In order to perform a first experiment on the recognition of linguistic descriptors that are not only lexical, we then develop a compact and generalizable representation of signers in videos. This is done by parallel processing of the hands, face and upper body, using existing tools and models that we have set up. Besides, we preprocess these parallel representations to obtain a relevant feature vector. We then present an adapted and modular architecture for automatic learning of linguistic descriptors, consisting of a recurrent and convolutional neural network.Finally, we show through a quantitative and qualitative analysis the effectiveness of the proposed model, tested on Dicta-Sign-LSF-v2. We first carry out an in-depth analysis of the parameterization, evaluating both the learning model and the signer representation. The study of the model predictions then demonstrates the merits of the proposed approach, with a very interesting performance for the continuous recognition of four linguistic descriptors, especially in view of the uncertainty related to the annotations themselves. The segmentation of the latter is indeed subjective, and the very relevance of the categories used is not strongly demonstrated. Indirectly, the proposed model could therefore make it possible to measure the validity of these categories. With several areas for improvement being considered, particularly in terms of signer representation and the use of larger corpora, the results are very encouraging and pave the way for a wider understanding of continuous Sign Language Recognition.
Article
Full-text available
Machine learning has been used to accurately classify musical genre using features derived from audio signals. Musical genre, as well as lower-level audio features of music, have also been shown to influence music-induced movement, however, the degree to which such movements are genre-specific has not been explored. The current paper addresses this using motion capture data from participants dancing freely to eight genres. Using a Support Vector Machine model, data were classified by genre and by individual dancer. Against expectations, individual classification was notably more accurate than genre classification. Results are discussed in terms of embodied cognition and culture.
Conference Paper
Full-text available
This paper presents a 3D corpus of motion capture data on French Sign Language (LSF), which is the first one available for the scientific community for pluridisciplinary studies. The paper also exhibits the usefulness of performing kinematic analysis on the corpus. The goal of the analysis is to acquire informative and quantitative knowledge for the purpose of better understanding and modelling LSF movements. Several LSF native signers are involved in the project. They were asked to describe 25 pictures in a spontaneous way while the 3D position of various body parts was recorded. Data processing includes identifying the markers, interpolating the information of missing frames, and importing the data to an annotation software to segment and classify the signs. Finally, we present the results of an analysis performed to characterize information-bearing parameters and use them in a data mining and modelling perspective.
Article
Full-text available
Karate is a martial art that partly depends on subjective scoring of complex movements. Principal component analysis (PCA)-based methods can identify the fundamental synergies (principal movements) of motor 10 system, providing a quantitative global analysis of technique. In this study, we aimed at describing the fundamental multi-joint synergies of a karate performance, under the hypothesis that the latter are skill dependent ; estimate karateka's experience level, expressed as years of practice.A motion capture system recorded traditional karate techniques of 10 professional and amateur karateka. At any time point, the 3D-coordinates of body markers produced posture vector that were normalised, concatenated from all karateka and 15 submitted to a first PCA. Five principal movements described both gross movement synergies and individual differences. A second PCA followed by linear regression estimated the years of practice using principal movements (eigenpostures and weighting curves) and centre of mass kinematics (error: 3.71 years; R2 = 0.91, P ≪ 0.001 AQ1). Principal movements and eigenpostures varied among different karateka and as functions of experience. This approach provides a framework to develop visual tools for the analysis of 20 motor synergies in karate, allowing to detect the multi-joint motor patterns that should be restored after an injury, or to be specifically trained to increase performance.
Article
Full-text available
Purpose Sign language users recruit physical properties of visual motion to convey linguistic information. Research on American Sign Language (ASL) indicates that signers systematically use kinematic features (e.g., velocity, deceleration) of dominant hand motion for distinguishing specific semantic properties of verb classes in production (Malaia & Wilbur, 2012a) and process these distinctions as part of the phonological structure of these verb classes in comprehension (Malaia, Ranaweera, Wilbur, & Talavage, 2012). These studies are driven by the event visibility hypothesis by Wilbur (2003), who proposed that such use of kinematic features should be universal to sign language (SL) by the grammaticalization of physics and geometry for linguistic purposes. In a prior motion capture study, Malaia and Wilbur (2012a) lent support for the event visibility hypothesis in ASL, but there has not been quantitative data from other SLs to test the generalization to other languages. Method The authors investigated the kinematic parameters of predicates in Croatian Sign Language (Hrvatskom Znakovnom Jeziku [HZJ]). Results Kinematic features of verb signs were affected both by event structure of the predicate (semantics) and phrase position within the sentence (prosody). Conclusion The data demonstrate that kinematic features of motion in HZJ verb signs are recruited to convey morphological and prosodic information. This is the first crosslinguistic motion capture confirmation that specific kinematic properties of articulator motion are grammaticalized in other SLs to express linguistic features.
Conference Paper
Full-text available
The MobileASL project aims to increase accessibility by enabling Deaf people to communicate over video cell phones in their native language, American Sign Language (ASL). Real-time video over cell phones can be a computationally intensive task that quickly drains the battery, rendering the cell phone useless. Properties of conversational sign language allow us to save power and bits: namely, lower frame rates are possible when one person is not signing due to turn-taking, and signing can potentially employ a lower frame rate than fingerspelling. We conduct a user study with native signers to examine the intelligibility of varying the frame rate based on activity in the video. We then describe several methods for automatically determining the activity of signing or not signing from the video stream in real-time. Our results show that varying the frame rate during turn-taking is a good way to save power without sacrificing intelligibility, and that automatic activity analysis is feasible.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Conference Paper
Animating sign language requires both a model of the structure of the target language and a computer animation system capable of producing the resulting avatar motion. On the language modelling side, AZee proposes a methodology and formal description mechanism to build grammars of Sign languages. It has mostly assumed the existence of an avatar capable of rendering its low-level articulation specifications. On the computer animation side, the Paula animator system proposes a multi-track SL generation platform designed for realism of movements, programmed from its birth to be driven by linguistic input.
Article
This work is related Sign Language utterance generation and coarticulation. It first lists the important feature to tackle in such systems, and introduce the notion of coarticulation in SL. Then, we describe one of our SL generation platforms, and the way coarticulation is integrated. The current version allows us to evaluate the settings for some coarticulation effects, such as transition and sign durations.
Article
Synopsis—The most obvious method for determining the distortion of telegraph signals is to calculate the transients of the telegraph system. This method has been treated by various writers, and solutions are available for telegraph lines with simple terminal conditions. It is well known that the extension of the same methods to more complicated terminal conditions, which represent the usual terminal apparatus, leads to great difficulties. The present paper attacks the same problem from the alternative standpoint of the steady-state characteristics of the system. This method has the advantage over the method of transients that the complication of the circuit which results from the use of terminal apparatus does not complicate the calculations materially. This method of treatment necessitates expressing the criteria of distortionless transmission in terms of the steady-state characteristics. Accordingly, a considerable portion of the paper describes and illustrates a method for making this translation. A discussion is given of the minimum frequency range required for transmission at a given speed of signaling. In the case of carrier telegraphy, this discussion includes a comparison of single-sideband and double-sideband transmission. A number of incidental topics is also discussed.
Conference Paper
In this paper we present some custom designed filters for real-time motion capture applications. Our target application is motion controllers, i.e. systems that interpret hand motion for musical interaction. In earlier research we found effective methods to design nearly optimal filters for realtime applications. However, to be able to design suitable filters for our target application, it is necessary to establish the typical frequency content of the motion capture data we want to filter. This will again allow us to determine a reasonable cutoff frequency for the filters. We have therefore conducted an experiment in which we recorded the hand motion of 20 subjects. The frequency spectra of these data together with a method similar to the residual analysis method were then used to determine reasonable cutoff frequencies. Based on this experiment, we propose three cutoff frequencies for different scenarios and filtering needs: 5, 10 and 15 Hz, which correspond to heavy, medium and light filtering, respectively. Finally, we propose a range of real-time filters applicable to motion controllers. In particular, low-pass filters and low-pass differentiators of degrees one and two, which in our experience are the most useful filters for our target application.
Article
Movement of the hands and arms through space is an essential element both in the lexical structure of American Sign Language (ASL), and, most strikingly, in the grammatical structure of ASL: it is in patterned changes of the movement of signs that many grammatical attributes are represented. These grammatical attributes occur as an isolable superimposed layer of structure, as demonstrated by the accurate identification by deaf signers of these attributes presented only as dynamic point-light displays. Three-dimensional computer graphic analyses were applied in two domains, to quantify the nature of the 'phonological' (formational) distinctions underlying the structure of grammatical processes in ASL. In the first, we show that for one 'phonological' opposition, evenness/unevenness of movement, a ratio of maximum velocities throughout the movement perfectly captures the linguistic classification of forms along this dimension. In the second, we map out a two-dimensional visual-articulatory space that captures in terms of signal properties, relevant relationships among movement forms that were independently posited as linguistically relevant. The fact that we are finding direct correspondences between properties of the signal and properties of the 'phonological' system in sign language, may arise in part because in sign languages, unlike in spoken languages, the movements of the articulators themselves are directly observable, and, also in part, because of the predominantly layered 'phonological' organization of sign language.
Article
A method is developed for representing any communication system geometrically. Messages and the corresponding signals are points in two `function spaces,' and the modulation process is a mapping of one space into the other. Using this representation, a number of results in communication theory are deduced concerning expansion and compression of bandwidth and the threshold effect. Formulas are found for the maximum rate of transmission of binary digits over a system when the signal is perturbed by various types of noise. Some of the properties of `ideal' systems which transmit at this maximum rate are discussed. The equivalent number of binary digits per second for certain information sources is calculated.
Article
American Sign Language (ASL) is a gestural language used by the hearing impaired. This paper describes experimental tests with deaf subjects that compared the most effective known methods of creating extremely compressed ASL images. The minimum requirements for intelligibility were determined for three basically different kinds of transformations: (1) gray-scale transformations that subsample the images in space and time; (2) two-level intensity quantization that converts the gray scale image into a black-and-white approximation; (3) transformations that convert the images into black and white outline drawings (cartoons). In Experiment 1, five subjects made quality ratings of 81 kinds of images that varied in spatial resolution, frame rate, and type of transformation. The most promising image size was 96 × 64 pixels (height × width). The 17 most promising image transformations were selected for formal intelligibility testing: 38 deaf subjects viewed 87 ASL sequences 1–2 s long of each transformation. The most effective code for gray-scale images is an analog raster code, which can produce images with 0.86 normalized intelligibility (I) at a bandwidth of 2,880 Hz and therefore is transmittable on ordinary 3 KHz telephone circuits. For the binary images, a number of coding schemes are described and compared, the most efficient being an extension of the quadtree method, here termed binquad coding which yielded I = 0.68 at 7,500 bits per second (bps). For cartoons, an even more efficient polygonal transformation with a victorgraph code yielding, for connected straight line segments, is proposed, together with a vectorgraph code yielding, for example, I = 0.56 at 3,900 bps and I = 0.70 at 6,000 bps. Polygonally transformed cartoons offer the possibility of telephonic ASL communication at 4,800 bps. Several combinations of binary image transformations and encoding schemes offer I > 80% at 9,600 bps.
Article
The classic book on human movement in biomechanics, newly updated. Widely used and referenced, David Winter's Biomechanics and Motor Control of Human Movement is a classic examination of techniques used to measure and analyze all body movements as mechanical systems, including such everyday movements as walking. It fills the gap in human movement science area where modern science and technology are integrated with anatomy, muscle physiology, and electromyography to assess and understand human movement. In light of the explosive growth of the field, this new edition updates and enhances the text with: Expanded coverage of 3D kinematics and kinetics. New materials on biomechanical movement synergies and signal processing, including auto and cross correlation, frequency analysis, analog and digital filtering, and ensemble averaging techniques. Presentation of a wide spectrum of measurement and analysis techniques. Updates to all existing chapters. Basic physical and physiological principles in capsule form for quick reference. An essential resource for researchers and student in kinesiology, bioengineering (rehabilitation engineering), physical education, ergonomics, and physical and occupational therapy, this text will also provide valuable to professionals in orthopedics, muscle physiology, and rehabilitation medicine. In response to many requests, the extensive numerical tables contained in Appendix A: "Kinematic, Kinetic, and Energy Data" can also be found at the following Web site: www.wiley.com/go/biomechanics.
Article
Perception of dynamic events of American Sign Language (ASL) was studied by isolating information about motion in the language from information about form. Four experiments utilized Johansson's technique for presenting biological motion as moving points of light. In the first, deaf signers were highly accurate in matching movements of lexical signs presented in point-light displays to those normally presented. Both discrimination accuracy and the pattern of errors were similar in this matching task to that obtained in a control condition in which the same signs were always represented normally. The second experiment showed that these results held for discrimination of morphological operations presented in point-light displays as well. In the third experiment, signers were able to accurately identify signs of a constant handshape and morphological operations acting on signs presented in point-light displays. Finally, in Experiment 4, we evaluated what aspects of the motion patterns carried most of the information for sign identifiability. We presented signs in point-light displays with certain lights removed and found that the movement of the fingertips, but not of any other pair of points, is necessary for sign identification and that, in general, the more distal the joint, the more information its movement carries.
Article
Access to telecommunication systems by deaf users of sign language can be greatly enhanced with the incorporation of video conferencing in addition to text-based adaptations. However, the communication channel bandwidth is often challenged by the spatial requirements to represent the image in each frame and temporal demands to preserve the movement trajectory with a sufficiently high frame rate. Effective systems must balance the portion of a limited channel bandwidth devoted to the quality of the individual frames and the frame rate in order to meet their intended needs. Conventional video conferencing technology generally addresses the limitations of channel capacity by drastically reducing the frame rate, while preserving image quality. This produces a jerky image that disturbs the trajectories of the hands and arms, which are essential in sign language. In contrast, a sign language communication system must provide a frame rate that is capable of representing the kinematic bandwidth of human movement. Prototype sign language communication systems often attempt to maintain a high frame rate by reducing the quality of the image with lossy spatial compression. Unfortunately, this still requires a combined spatial and temporal data rate, which exceeds the limited channel of residential and wireless telephony. While spatial compression techniques have been effective in reducing the data, there has been no comparable compression of sign language in the temporal domain. Even modest reductions in the frame rate introduce perceptually disturbing flicker that decreases intelligibility. This paper introduces a method through which temporal compression on the order of 5:1 can be achieved. This is accomplished by decoupling the biomechanical or kinematic bandwidth necessary to represent continuous movements in sign language from the perceptually determined critical flicker frequency.
Article
A method is developed for representing any communication system geometrically. Messages and the corresponding signals are points in two "function spaces," and the modulation process is a mapping of one space into the other. Using this representation, a number of results in communication theory are deduced concerning expansion and compression of bandwidth and the threshold effect. Formulas are found for the maxmum rate of transmission of binary digits over a system when the signal is perturbed by various types of noise. Some of the properties of "ideal" systems which transmit at this maxmum rate are discussed. The equivalent number of binary digits per second for certain information sources is calculated.
Article
The use of the fast Fourier transform in power spectrum analysis is described. Principal advantages of this method are a reduction in the number of computations and in required core storage, and convenient application in nonstationarity tests. The method involves sectioning the record and averaging modified periodograms of the sections.
Article
The most obvious method for determining the distortion of telegraph signals is to calculate the transients of the telegraph system. This method has been treated by various writers, and solutions are available for telegraph lines with simple terminal conditions. It is well known that the extension of the same methods to more complicated terminal conditions, which represent the usual terminal apparatus, leads to great difficulties. The present paper attacks the same problem from the alternative standpoint of the steady-state characteristics of the system. This method has the advantage over the method of transients that the complication of the circuit which results from the use of terminal apparatus does not complicate the calculations materially. This method of treatment necessitates expressing the criteria of distortionless transmission in terms of the steady-state characteristics. Accordingly, a considerable portion of the paper describes and illustrates a method for making this translation. A discussion is given of the minimum frequency range required for transmission at a given speed of signaling. In the case of carrier telegraphy, this discussion includes a comparison of single-sideband and double-sideband transmission. A number of incidental topics is also discussed
A kinematic analysis of sign language
  • C C Koech
Lsf-animal: A motion capture corpus in french sign language designed for the animation of signing avatars
  • L Naert
  • C Larboulette
  • S Gibet
Building french sign language motion capture corpora for signing avatars
  • S Gibet
Lsf-animal: A motion capture corpus in french sign language designed for the animation of signing avatars
  • naert