Conference PaperPDF Available

Optimised sign language video coding based on eye-tracking analysis

Authors:

Abstract and Figures

The imminent arrival of mobile video telephony will enable deaf people to communicate - as hearing people have been able to do for a some time now - anytime/anywhere in their own language sign language. At low bit rates coding of sign language sequences is very challenging due to the high level of motion and the need to maintain good image quality to aid with understanding. This paper presents optimised coding of sign language video at low bit rates in a way that will favour comprehension of the compressed material by deaf users. Our coding suggestions are based on an eye-tracking study that we have conducted which allows us to analyse the visual attention of sign language viewers. The results of this study are included in this paper. Analysis and results for two coding methods, one using MPEG-4 video objects and the second using foveation filtering are presented. Results with foveation filtering are very promising, offering a considerable decrease in bit rate in a way which is compatible with the visual attention patterns of deaf people, as these were recorded in the eye tracking study.
Content may be subject to copyright.
Perceptually optimised sign language video
coding based on eye tracking analysis
D. Agrafiotis, N. Canagarajah, D.R. Bull and M. Dye
A perceptually optimised approach to sign language video coding is
presented. The proposed approach is based on the results (included) of
an eye tracking study in the visual attention of sign language viewers.
Results show reductions in bit rate of over 30% with very good
subjective quality.
Introduction: Coding of image sequences will always result in some
information being lost, especially at low bit rates. With sign languages
being visual languages, good image quality is necessary for under-
standing. Information loss should be localised so that it does not
significantly impair sign language comprehension.
In this Letter we describe a foveated approach to coding sign
language sequences with H.264 at low bit rates. We base our proposal
on the results of an eye tracking study that we have conducted which
allows us to analyse the visual attention of sign language viewers.
Foveated processing is applied prior to coding in order to produce a
foveation map which drives the quantiser in the encoder. Our coding
approach offers significant bit rate reductions in a way that is compa-
tible with the visual attention patterns of deaf people, as these were
recorded in the eye tracking study.
Eye tracking study: Eleven subjects took part in the experiments,
including deaf people, hearing signers and hearing beginners in
British Sign Language (BSL). The experiments involved watching
four short narratives in BSL. The clips were displayed uncompressed
in the CIF format (352288, 4:2:0) at 25 frames per second (fps).
The Eyelink eye tracking system was used to record the participants
eye-gaze while watching the four clips. Analysis of the results [1], and
mainly of the fixation location and duration (i.e. locus at which eye-
gaze is directed) showed that sign language viewers, excluding the
hearing beginners, seem to concentrate on the facial area and
especially the mouth. Most of the participants never looked at
the hands while only a few showed just a small tendency to look at
the hands. In contrast, hearing beginners did look at the hands more
frequently (mainly due to a lack of understanding). Fig. 1 summarises
some of the results in terms of the vertical position yof the recorded
fixation points for a number of subjects and for one of the clips.
Fig. 1 Vertical position of fixation points for first 250 frames of eye-
tracking clip-2 for four participants
– participant 2 e– participant 3
u participant 5 * participant 8
Foveated processing: Foveated video compression [2, 3] aims to
exploit the fall off in spatial resolution of the human visual system
away from the point of fixation in order to reduce the bandwidth
requirements of compressed video. A problem associated with foveated
processing is the fact that the viewer’s point of fixation has to be known,
something that in practice requires real-time tracking of the viewer’s eye-
gaze. In our case (sign language viewers) and based on our study, the
point of fixation lies almost always on the face, and specifically close to
the mouth, thus removing the need to track the viewer’s eye-gaze.
We have followed the local bandwidth approach as described in [3].
According to this method the video image is partitioned into eight
different regions based on their eccentricity (effectively distance from
the centre of fixation) with the regions being constrained to be the
union of disjoint macroblocks (MBs). The formula used to calculate the
foveation regions together with suggested fitting parameters are given in
[1] and [2]. Eccentricity edepends on viewing distance and is given by:
e¼tan1dðxÞ
Nv
 ð1Þ
where d(x) is the Euclidean distance of pixel xfrom the fixation point, N
is the width of the image and vis the viewing distance (in 100s of
pixels) with all distances and co-ordinate measurements being normal-
ised to the physical dimensions of pixels on the viewing screen.
Foveated processing produces a map showing the region each MB
belongs to for each frame. Fig. 2 shows one such map for v¼3
alongside the corresponding frame of sequence ‘FDD’. In [1] we
have used such maps to pre-filter the MBs of each region with a
lowpass filter, which had a lower cutoff frequency for MBs of higher
eccentricity. In this work we use the foveation map to assign a different
quantisation parameter (QP) to each MB with MBs lying in regions
away from the point of fixation being allocated a higher QP.
1234 567 8910111213141516171819
20 21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Fig. 2 Foveation MB map for one frame of clip ‘F DD’ with v ¼3, region
numbers also shown
H.264 coding: The number of bits required for coding each MB in a
video frame depends on the level of activity in the MB, the effec-
tiveness of the prediction, and the quantisation step size. The latter is
controlled by the quantisation parameter (QP). The value of QP
controls the amount of compression and corresponding fidelity
reduction for each MB. The foveation map described in the previous
step is combined with a given range of QP values to produce MB
regions that will be quantised with different step sizes, with the step
size getting bigger for regions of higher eccentricity. More specifi-
cally, an algorithm was written which ensures that outer regions
always have their QP increased before inner regions, and that the
highest QP in the range is assigned to the lowest priority region. For
example, for QP
min
(minimum) ¼30 and QP
max
(maximum) ranging
from 30 to 40 (i.e. for a QP range of 0 to 10) we get the QP allocations
showninTable1fortheeightdifferentfoveationregions(region0is
the highest priority region around the face the radius of which should
be specified).
Tab l e 1 : Region QP assignments for different QP ranges
QP range
Region
01234567
0 3030303030303030
1 3030303030303031
2 3030303030303132
3 3030303030313233
4 3030303031323334
5 3030303132333435
6 3030313233343536
7 3031323334353637
8 3031323334353638
9 3031323334353739
10 30 31 32 33 34 36 38 40
When a variable QP (VQP) is used a small overhead is introduced in
the bit stream due to coding of the different QP values of MBs lying
on region borders. The overhead for the coded ‘FDD’ sequence
(268 frames) with v¼3 and QP range 30–40 was approximately
2.925 Kbits=s.
ELECTRONICS LETTERS 27th November 2003 Vol. 39 No. 24
Table 2: Results of proposed VQP approach with and without
(foveated) pre-filtering against results with constant QP
(CQP), with same QP assignment for MBs corres-
ponding to face region
‘FDD’—foveated filtering (FV) þvariable QP (VQP) (30–40)
V
VQP rate
(Kbit=s)
VQP reduction
(%)
FV þVQP
rate (Kbit=s)
FV þVQP
reduction (%)
Original (CQP 30) 112.84
4 78.51 30.43 74.76 33.75
3 74.94 33.58 71.84 36.33
‘Moving’—foveated filtering (FV) þvariable QP (VQP) (30–40)
V
VQP rate
(Kbit=s)
VQP reduction
(%)
FV þVQP
rate (Kbit=s)
FV þVQP
reduction (%)
Original (CQP 30) 211.51
4 137.52 34.98 125.15 40.83
3 130.67 38.22 118.17 44.13
b
a
Fig. 3 One decoded frame coded with (left) constant QP ¼30 (CQP) and
(right) v ariable QP ¼30–40 with v ¼4(VQP)
a‘FDD clip’, VQP bit rate 30% less than CQP
b‘Moving clip’, VQP þpre-filtering bit rate 40% less than CQP
Results: The H.264 reference software was modified to enable vari-
able quantisation based on a given foveation map. The foveated
processing unit which supplies the foveation map can also apply
pre-filtering based on the same map. Two sequences were used,
namely ‘FDD,’ a blue (plain) background sequence, and ‘Moving’,
a moving background sequence. The output bitstreams conform to the
baseline profile. The input frame rate was 25 fps, and the output frame
rate 12.5 fps. Results for the two sequences are shown in Table 2.
One decoded frame from each clip is shown in Fig. 3. It can be seen
that a significant reduction of bit rate is achieved while keeping the
quality of the important regions high (the face and the surrounding
MBs). The subjective quality of the whole frame is also very good.
Pre-filtering can offer approximately an additional 6% improvement
in compression efficiency.
Conclusion: A foveated approach to sign language video coding is
presented for lowering the bit rate requirements of sign language
video without affecting significantly subjective quality (especially
from a deaf viewer’s point of view). Results show that a reduction
of over 30% can be achieved while keeping the quality of important
regions high. The proposed approach is based on the results of our eye
tracking study which showed that experienced sign language viewers
concentrate on the face and especially the mouth region.
#IEE 2003 3 October 2003
Electronics Letters Online No: 20031140
DOI: 10.1049/el:20031140
D. Agrafiotis, N. Canagarajah and D.R. Bull (Image Communications
Group, Centre for Communications Research, University of Bristol,
Woodland Road, BS8 1UB, United Kingdom)
M. Dye (Centre for Deaf Studies, University of Bristol, Bristol BS8
2TN, UK)
M. Dye: Now with Department of Brain and Cognitve Sciences,
University of Rochester, Rochester, NY, USA.
References
1AGRAFIOTIS, D., et al.: ‘Optimised sign language video coding based on
eye-tracking analysis’. SPIE Int. Conf. on Visual Communications and
Image Processing (VCIP), Lugano, Switzerland, July 2003
2GEISLER, W.S., and PERRY, J.S.: ‘A real-time foveated multiresolution
system for low-bandwidth video communication’, SPIE Proc., 1998,
3299
3SHEIKH, H.R., et al.: ‘Real time foveation techniques for H.263 video
encoding in software’. IEEE Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP), 2001, Vol. 3, pp. 1781–1784
ELECTRONICS LETTERS 27th November 2003 Vol. 39 No. 24
... Many were motivated by work investigating the focal region of ASL signers. Separate research groups used an eyetracker to follow the visual patterns of signers watching sign language video and determined that users focused almost entirely on the face [2,71]. In some sense, this is intuitive, because humans perceive motion using their peripheral vision [9]. ...
... implemented foveal compression, in which the macroblocks at the center of the user's focus are coded at the highest quality and with the most bits; the quality falls off in concentric circles [2]. Their videos were not evaluated by Deaf users. ...
... For example, suppose I had a playback rate of L = 3 and converted frames lengths of 1 = 2, 2 = 1, 3 = 1. Then to play on one channel, our minimum delay would be 3, giving (3,2), (6, 1), (9,1) . The tree representation of the schedule is in Figure A To find the tree, the algorithm simply starts with one node of size (w 1 , w 1 ) and schedules the first job (w 1 , 1 ) as a child. ...
Article
Full-text available
Abstract Activity Analysis of Sign Language Video for Mobile Telecommunication
... Studies examining where deaf individuals look during sign language comprehension have demonstrated that the face is attended to more than other visual cues, including the hands (Agrafiotis et al. 2003;Emmorey et al. 2009). Although there may be a number of reasons for this, for example the face provides linguistic and social information as well as cues for lip reading (Letourneau and Mitchell 2011), one may be that important emotional information is conveyed by the face and that signers need to pay particular attention to facial cues in the absence of tone of voice information and other auditory cues (Reilly et al. 1990). ...
Thesis
Full-text available
Normally hearing children with ASD are often reported to have a lack of interest in others, particularly when looking at faces, as a result of this they manifest difficulties understanding and using facial expressions compared to typically developing controls. Deaf children often show advantages with the processing of the face, as they need to look to the face more to communicate, due to the presence of linguistic facial expressions in British Sign Language (BSL). It is unknown how deaf individuals with ASD will fare when processing faces. This is the first study to look at how deaf children with ASD compare to typically developing deaf controls on a face processing measure and a number of comprehension and production measures looking at affective and linguistic facial actions in BSL. Surprisingly the deaf ASD group showed no general face processing impairment or difficulty attending to the face for the purpose of communication, they did not show characteristics usually associated with hearing individuals with ASD. This suggests the extra experience gained from attending to faces may reduce face processing impairments in deaf individuals with ASD. More research is needed to warrant this conclusion. The deaf ASD group did demonstrate specific impairments with the comprehension and production of some affective facial expressions in BSL. Linguistic facial expressions were largely preserved, with the exception of adverbials. The impairments that emerged in the deaf ASD group were most pronounced when production or comprehension of the face required attributions about the mental states of others. These results suggest that deaf individuals with ASD are not impaired with face processing, rather they have a highly specific and subtle pattern of impairments with using the face in sign language.
... Studies examining where deaf individuals look during sign language comprehension have demonstrated that the face is attended to more than other visual cues, including the hands (Agrafiotis et al. 2003;Emmorey et al. 2009). Although there may be a number of reasons for this, for example the face provides linguistic and social information as well as cues for lip reading (Letourneau and Mitchell 2011), one may be that important emotional information is conveyed by the face and that signers need to pay particular attention to facial cues in the absence of tone of voice information and other auditory cues (Reilly et al. 1990). ...
... Studies examining where deaf individuals look during sign language comprehension have demonstrated that the face is attended to more than other visual cues, including the hands (Agrafiotis et al. 2003;Emmorey et al. 2009). Although there may be a number of reasons for this, for example the face provides linguistic and social information as well as cues for lip reading (Letourneau and Mitchell 2011), one may be that important emotional information is conveyed by the face and that signers need to pay particular attention to facial cues in the absence of tone of voice information and other auditory cues (Reilly et al. 1990). ...
Conference Paper
Full-text available
Background: Children with autism tend to look less at others’ faces (Klin, Jones et al., 2002; Dawson et al, 2004, 2005) and show deficits on a range of face processing tasks compared to their peers (Schultz 2005). Such impairments might have specific consequences for deaf children with autism who use sign language, as the face plays an important role in sign language, communicating both linguistic as well as affective information. This is the first known study to date which examines the extent to which deaf children with autism comprehend the linguistic use of the face in sign language. Objectives: Are deaf individuals with autism impaired at comprehending a facial act that has linguistic significance in British Sign Language (BSL): the negation of a statement? Methods: Test of negation comprehension in BSL Sentences which use negation involve expressing that something is not present or in existence (Sutton Spence and Woll 1999, chapter 4). Negation in BSL is unique in that its linguistic meaning can be conveyed through face actions alone, or through a combination of the lexical sign for negation (hands), as well as the face. In our task deaf children with autism and typically developing deaf children (controls) watched short video clips of a signer producing signed phrases in three conditions; i) positive (27 trials), ii) negative: hand sign and facial action (16 trials), and iii) negative: facial action alone (16 trials). After each trial the child is shown two pictures and asked to choose the picture which matches the sign. It is hypothesised that deaf children with autism will have greater difficulty relative to controls at comprehending negation in the face only condition compared with the face and hands condition. The controls are expected to show no difference between comprehension of negation in both face and hands and facial action only conditions. Results: Results from typically developing deaf children indicate that there is no significant difference in accuracy between the two negation conditions (face and hands and facial action alone). Preliminary results from the deaf children with autism suggest that they do show some difficulty comprehending negation when it is on the face alone compared with when it is on the face and hands. Conclusions: Results suggest that deaf children with autism may have some difficulties comprehending facial expressions in sign language compared to their typically developing peers. These findings indicate that it is possible for face processing difficulties associated with autism to have subtle effects on sign language comprehension.
... We shall consider the following four x264 motion search methods, which are listed in increasing complexity and have been described in Section 1.3: DIA, HEX, UMH and ESA.In our tests, we consider 8 different region-of-interest (ROI) based motion search methods: UMH for all regions, while the other 7 methods use DIA for the background and use either DIA, HEX or UMH for the head and torso regions. In[73] and[74], it was shown that fluent signers look at the face 95% of the time when watching ASL videos. This leads us to choose a motion search method for the torso region of similar or lower complexity when compared to the motion search of the face region, thereby ensuring better quality for the face region. ...
... Different messages can be conveyed through facial expressions such as a simple head shake, the position of their eyebrows, and the use of the mouth. People receiving signs primarily focus their attention on the mouth rather than the hands of an ASL signer [6,24,27]. Therefore, ASL and other sign languages are considered visual languages [16] because information is not communicated with the use of sounds. ...
... Studies examining where deaf individuals look during sign language comprehension have demonstrated that the face is attended to more than other visual cues, including the hands (Agrafiotis et al. 2003;Emmorey et al. 2009). Although there may be a number of reasons for this, for example the face provides linguistic and social information as well as cues for lip reading (Letourneau and Mitchell 2011), one may be that important emotional information is conveyed by the face and that signers need to pay particular attention to facial cues in the absence of tone of voice information and other auditory cues (Reilly et al. 1990). ...
Article
Full-text available
Facial expressions in sign language carry a variety of communicative features. While emotion can modulate a spoken utterance through changes in intonation, duration and intensity, in sign language specific facial expressions presented concurrently with a manual sign perform this function. When deaf adult signers cannot see facial features, their ability to judge emotion in a signed utterance is impaired (Reilly et al. in Sign Lang Stud 75:113-118, 1992). We examined the role of the face in the comprehension of emotion in sign language in a group of typically developing (TD) deaf children and in a group of deaf children with autism spectrum disorder (ASD). We replicated Reilly et al.'s (Sign Lang Stud 75:113-118, 1992) adult results in the TD deaf signing children, confirming the importance of the face in understanding emotion in sign language. The ASD group performed more poorly on the emotion recognition task than the TD children. The deaf children with ASD showed a deficit in emotion recognition during sign language processing analogous to the deficit in vocal emotion recognition that has been observed in hearing children with ASD.
Article
Full-text available
Children’s gaze behavior reflects emergent linguistic knowledge and real‐time language processing of speech, but little is known about naturalistic gaze behaviors while watching signed narratives. Measuring gaze patterns in signing children could uncover how they master perceptual gaze control during a time of active language learning. Gaze patterns were recorded using a Tobii X120 eye‐tracker, in 31 non‐signing and 30 signing hearing infants (5‐14 months) and children (2‐8 years) as they watched signed narratives on video. Intelligibility of the signed narratives was manipulated by presenting them naturally and in video‐reversed (“low intelligibility”) conditions. This video manipulation was used because it distorts semantic content, while preserving most surface phonological features. We examined where participants looked, using linear mixed models with Language Group (non‐signing vs. signing) and Video Condition (Forward vs. Reversed), controlling for trial order. Non‐signing infants and children showed a preference to look at the face as well as areas below the face, possibly because their gaze was drawn to the moving articulators in signing space. Native signing infants and children demonstrated resilient, face‐focused gaze behavior. Moreover, their gaze behavior was unchanged for video‐reversed signed narratives, similar to what was seen for adult native signers, possibly because they already have efficient highly focused gaze behavior. The present study demonstrates that human perceptual gaze control is sensitive to visual language experience over the first year of life and emerges early, by 6 months of age. Results have implications for the critical importance of early visual language exposure for deaf infants.
Article
Full-text available
Visual information conveyed by iconic hand gestures and visible speech can enhance speech comprehension under adverse listening conditions for both native and non‐native listeners. However, how a listener allocates visual attention to these articulators during speech comprehension is unknown. We used eye‐tracking to investigate whether and how native and highly proficient non‐native listeners of Dutch allocated overt eye gaze to visible speech and gestures during clear and degraded speech comprehension. Participants watched video clips of an actress uttering a clear or degraded (6‐band noise‐vocoded) action verb while performing a gesture or not, and were asked to indicate the word they heard in a cued‐recall task. Gestural enhancement was the largest (i.e., a relative reduction in reaction time cost) when speech was degraded for all listeners, but it was stronger for native listeners. Both native and non‐native listeners mostly gazed at the face during comprehension, but non‐native listeners gazed more often at gestures than native listeners. However, only native but not non‐native listeners' gaze allocation to gestures predicted gestural benefit during degraded speech comprehension. We conclude that non‐native listeners might gaze at gesture more as it might be more challenging for non‐native listeners to resolve the degraded auditory cues and couple those cues to phonological information that is conveyed by visible speech. This diminished phonological knowledge might hinder the use of semantic information that is conveyed by gestures for non‐native compared to native listeners. Our results demonstrate that the degree of language experience impacts overt visual attention to visual articulators, resulting in different visual benefits for native versus non‐native listeners.
Article
This article discusses a framework for model-based, context-dependent video coding based on exploitation of characteristics of the human visual system. The system utilizes variable-quality coding based on priority maps which are created using mostly context-dependent rules. The technique is demonstrated through two case studies of specific video context, namely open signed content and football sequences. Eye-tracking analysis is employed for identifying the characteristics of each context, which are subsequently exploited for coding purposes, either directly or through a gaze prediction model. The framework is shown to achieve a considerable improvement in coding efficiency.
Article
Foveated imaging exploits the fact that the spatial resolution of the human visual system decreases dramatically away from the point of gaze. Because of this fact, large bandwidth savings are obtained by matching the resolution of the transmitted image to the fall-off in resolution of the human visual system. We have developed a foveated multiresolution pyramid video coder/decoder which runs in real-time on a general purpose computer (i.e., a Pentium with the Windows 95/NT OS). The current system uses a foveated multiresolution pyramid to code each image into 5 or 6 regions of varying resolution. The user-controlled foveation point is obtained from a pointing device (e.g., a mouse or an eyetracker). Spatial edge artifacts between the regions created by the foveation are eliminated by raised- cosine blending across levels of the pyramid, and by `foveation point interpolation' within levels of the pyramid. Each level of the pyramid is then motion compensated, multiresolution pyramid coded, and thresholded/quantized based upon human contrast sensitivity as a function of spatial frequency and retinal eccentricity. The final lossless coding includes zero-tree coding. Optimal use of foveated imaging requires eye tracking; however, there are many useful applications which do not require eye tracking.
Conference Paper
Video coding techniques employ characteristics of the human visual system (HVS) to achieve high coding efficiency. Lee (2000) and Bovik have exploited foveation, which is a non-uniform resolution representation of an image reflecting the sampling in the retina, for low bit-rate video coding. We develop a fast approximation of the foveation model and demonstrate real-time foveation techniques in the spatial domain and discrete cosine transform (DCT) domain. We incorporate fast DCT domain foveation into the baseline H.263 video encoding standard. We show that DCT-domain foveation requires much lower computational overhead but generates higher bit rates than spatial domain foveation. Our techniques do not require any modifications of the decoder