Article

Joint Optimization of Word Alignment and Epenthesis Generation for Chinese to Taiwanese Sign Synthesis

Department of Computer Science and Information Engineering, National Cheng Kung University, 臺南市, Taiwan, Taiwan
IEEE Transactions on Pattern Analysis and Machine Intelligence (Impact Factor: 5.78). 02/2007; 29(1):28-39. DOI: 10.1109/TPAMI.2007.15
Source: PubMed
ABSTRACT
This work proposes a novel approach to translate Chinese to Taiwanese sign language and to synthesize sign videos. An aligned bilingual corpus of Chinese and Taiwanese Sign Language (TSL) with linguistic and signing information is also presented for sign language translation. A two-pass alignment in syntax level and phrase level is developed to obtain the optimal alignment between Chinese sentences and Taiwanese sign sequences. For sign video synthesis, a scoring function is presented to develop motion transition-balanced sign videos with rich combinations of intersign transitions. Finally, the maximum a posteriori (MAP) algorithm is employed for sign video synthesis based on joint optimization of two-pass word alignment and intersign epenthesis generation. Several experiments are conducted in an educational environment to evaluate the performance on the comprehension of sign expression. The proposed approach outperforms the IBM Model 2 in sign language translation. Moreover, deaf students perceived sign videos generated by the proposed method to be satisfactory.

Full-text

Available from: Chung-Hsien Wu
Joint Optimization of Word Alignment
and Epenthesis Generation for Chinese
to Taiwanese Sign Synthesis
Yu-Hsien Chiu, Chung-Hsien Wu, Senior Member, IEEE, Hung-Yu Su, and Chih-Jen Cheng
Abstract—This work proposes a novel approach to translate Chinese to Taiwanese sign language and to synthesize sign videos. An
aligned bilingual corpus of Chinese and Taiwanese Sign Language (TSL) with linguistic and signing information is also presented for sign
language translation. A two-pass alignment in syntax level and phrase level is developed to obtain the optimal alignment between Chinese
sentences and Taiwanese sign sequences. For sign video synthesis, a scoring function is presented to develop motion transition-
balanced sign videos with rich combinations of intersign transitions. Finally, the maximum a posteriori (MAP) algorithm is employed for
sign video synthesis basedon joint optimization of two-pass word alignment and intersign epenthesis generation. Several experiments are
conducted in an educational environment to evaluate the performance on the comprehension of sign expression. The proposed approach
outperforms the IBM Model 2 in sign language translation. Moreover, deaf students perceived sign videos generated by the proposed
method to be satisfactory.
Index Terms—Taiwanese sign language, language translation, sign language synthesis, video concatenation.
Ç
1INTRODUCTION
S
IGN language is a visual/gestural language that serves as
the primary means of communication for deaf indivi-
duals, just as spoken languages are used among the hearing
[1], [2]. Deaf individuals encounter the difficulty that most
hearing individuals communicate with spoken language. A
language barr ier exists between these two populations.
Media, such as books, newspapers, and TV news, ar e
presented visually in written and spoken language, rather
than in the sign language with which they are most familiar.
Conversely, hearing people who communicate using sign
language as their second language also experience difficulties
in sign grammar and production [3]. The current state of the
technologies utilized to provide deaf and hearing individuals
with access to information and communication with each
other is inadequate [4], [5]. The support provided by most
computerized grammar checkers and language translators
does not provide thespecific requirements for translating sign
language from the written or spoken language. Accordingly,
designing a sign generation system is an attempt to exploit the
concurrent effects of linguistic and gestural characteristics of
TSL to enable sign expression to be learned intuitively.
Sign language translation research concentrates on gen-
erating written language from sign language by using image
recognition and virtual reality gloves [6], [7], [8], [9], [10], [11].
Gesture features involving hand shapes, movement, position,
and palm orientation have been employed to recognize and
segment sign image sequences. However, these methods
focus on image processing more than machine translation and
sign language analysis and have limited vocabularies.
Natural language processing is typically used to translate
fromwritten tosignlanguage. Brown[3] developedatransfer-
based model using lexical function grammar to create a
representation of American Sign Language (ASL) from
English, in which correspondences between English and
ASL are defined manually. He also reviewed other work and
the limitations relating to computational linguistics and ASL.
Many statistical approaches for machine translation and
language modeling have recently been addressed [12], [13].
Alignment probability estimation is applied to resolve hard
decisions at any level of the translation process, such as the
levels of words, phrases, and sentences. Rep resentative
systems include IBM Models [14] and Verbmobi [15]. The IBM
models,introducedby Brown et al., are aseries of five statistical
models for machine translation, which quantify information
such as alignment of word position and translation of word
pairsfrom bilingualcorpora. Since theIBM modelisa statistical
approach, it requires a large corpus for model training. The
IBM model is not a good choice for machine translation
between languages using only a small bilingual corpus.
Seve ral researchers propose d various ap proaches for
visual display of sign language by using signing avatars
and 3D animation [16], [17], [18], [19], [20] based on motion
generation [21], [22], [23], [24], [25]. Kennaway et al. [16]
developed an avatar approach, based on motion capture and
virtual reality, to synthesize sign animation from a high-level
description of signs in terms of the HamNoSys transcription
system. Synthesizing realistic 3D not only needs to drive an
avatar to act precisely but also smoothen the textures and
movements and, accordingly, is time-consuming. Related
works based on 3D animation and talking heads were
considered by [21], [22], [23], [24], [25] and [26], [27], [28],
28 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
. Y.-H. Chiu is with the Home Network Technology Center, Industrial
Technology Research Institute, N200, ITRI Bldg. R1, No. 31, Gongye 2nd
Rd., Annan District, Tainan City 709, Taiwan, ROC.
E-mail: chiuyh@itri.org.tw.
. C.-H. Wu, H.-Y. Su, and C.-J. Cheng are with the Department of
Computer Science and Information Engineering, National Cheng Kung
University, No.1, Ta-Hsueh Road, Tainan City 701, Taiwan, ROC.
E-mail: {chwu, elfsu, chengc}@csie.ncku.edu.tw.
Manuscript received 1 Dec. 2005; revised 27 Apr. 2006; accepted 24 May
2006; published online 13 Nov. 2006.
Recommended for acceptance by T. Darrell.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number TPAMI-0665-1205.
0162-8828/07/$20.00 ß 2007 IEEE Published by the IEEE Computer Society
Page 1
respectively, but are impractical due to the real-time produc-
tion issues and computational complexity. Since synthesizing
realistic video frames only considers the smoothness of
concatenated videos, the dimension reduction (from 3D to
2D) greatly decreases the complexity of the synthesis system.
Solina and Krape
z [29] presented an approach for concate-
nating sign video clips into video clips of complete sign
sequences of Slovene Sign Language (SSL). They eliminated
redundant moves of the hands and p roduced smooth
transitions between sign videos. Their concatenation criteria
are based on the sense of similar palm positions between
video clips. However, the motion in the synthesized video
sequence still suffers from discontinuity and only the
difference of palm positions is considered.
This work proposes a sequence correspondence model
between Chinese and TSL by the hierarchical alignment of a
different level of grammatical components. This hierarchical
Two-Pass alignment scheme can reduce the complexity of the
standard IBM model 2 and is suitable for translation with a
small bilingual corpus. Grammatical components at the
syntax and phrase levels are defined to tokenize sentences
into sequences of primary units (PUs) and secondary units
(SUs). Consecutive secondary units are further concatenated
into a phrase fragment (PF), which is also regarded as a PU at
the syntactical level. In the first-pass alignment, primary
units, including phrase fragments, are aligned at the syntax
level. In the second-pass alignment, secondary units within
the phrase fragments are aligned at the phrase level. An
alignment approach modified from IBM Model 2 is presented
to model the correspondence between the transliterated
word/sentence pairs. A sign video database with motion
transition information was developed for sign synthesis.
Finally, the maximum a posteriori (MAP) algorithm is
employed for sign synthesis based on joint optimization of
two-pass word alignment and intersign epenthesis genera-
tion. Fig. 1 displays the system diagram for sign translation
and epenthesis generation.
The remainder of this paper is organized as follows:
Section 2 describes the development of several significant
corpora applied in this work. Section 3 then presents a
model that translates Chinese to concatenated sign videos.
Next, Section 4 summarizes the experimental results.
Conclusions are finally drawn in Section 6, along with
suggestions for future research.
2CORPUS DEVELOPMENT
Sign language is a natural language based on vision and
space. The meaning of a sign is richer than that of a word. The
bilingual corpus used herein was annotated with information
concerning grammar and alignment from Chinese to TSL.
Spatial information, such as initial/final gestures, hand
positions, and moving types, was manually tagged. To define
the hand position, the signing space of the sign image was
simplified in this study to 25 regions. Since the intersign
region between the final region of the preceding sign video
and the initial region of the succeeding sign video is crucial
for achieving smooth video concatenation, all possible
movement transitions between two signs have to be included
in the corpus, called the transition-balanced corpus, for sign
video filming. Fig. 2 illustrates a frame of the sign-word
“sister,” which comprises the simplified signing space and
the possible movements of the ri ght hand. The spat ial
information of a sign is annotated to build a sign transition-
balanced corpus.
2.1 Bilingual Corpus with Annotated
TSL Information
Two thousand one hundred fifty nine frequently used signs
were gathered from teaching materials for deaf students in
primary schools [30]. Professor Jane S. Tsay of the Depart-
ment of Linguistics at National Chung Cheng University,
Taiwan, annotated the TSL information. Table 1 lists several
examples with different transliteration patterns in the aligned
parallel corpus. The syntactic information was extracted
automatically using the CKIP AutoTag system [31] during the
annotation process. A Chinese lexicon and parser were
employed to obtain the part-of-speech (POS) sequence and
the grammatical structure of a sentence. The HowNet
knowledgebase developed by Dong [32] was adopted to
analyze semantic roles, such as agents, times, and places.
Two grammatical components, primary units (PUs) and
secondary units (SUs), as listed in Table 2, are defined by
analyzing the bilingual sentences. Primary units are defined
as the main components of a Chinese sentence, such as verbs,
specific nouns including proper nouns, places, and time
nouns, adverbs (for modifying verbs), and conjunctions.
Secondary units are defined as the phrases that contain
common nouns, quantity nouns, adjectives, and adverbs (for
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 29
Fig. 1. System diagram for sign translation and epenthesis generation.
Fig. 2. Simplified signing space. The dotted arrows indicate the possible
transitions.
Page 2
modifying adjectives). The bilingual corpus can be tokenized
and denoted as PUs and SUs based on the defined
grammatical components. Furthermore, consecutive SUs are
grouped as a phrase fragment (PF), which is treated as a PU.
Fig. 3 shows the tokenization process for the sentence, “This
meeting needs five sign language (SL) translators to help.”
According to Table 2, the Chinese sentence can be tokenized
as {SU
1
: This (Nep), SU
2
: meeting (Na), PU
1
: need (VK), SU
3
:
five (Neu), SU
4
: SL (Na), SU
5
: translator (Na), PU
2
: help (VC)}.
The consecutive SUs can be further concatenated and
represented as phrase fragments (PF). Hence, the Chinese
sentence is represented as:
½PF
1
: fSU
1
: This ðNepÞ;SU
2
: meeting ðNaÞg;
PU
1
: need ðVKÞ; PF
2
: fSU
3
: fiveðNeuÞ;SU
4
: SLðNaÞ;
SU
5
: translator ðNaÞg; PU
2
: help ðVCÞ:
In the developed parallel corpus, 1,983 Chinese sentences
were aligned to TSL sequences, yielding 3,966 aligned parallel
sentences. The average length of sign words was 5.8. Table 3
shows the average number of occurrences of grammatical
components in each sentence. Appendix A presents the
grammatical components and POSs defined in this work.
The model extracted 1,983 distinct aligned primary unit
sequences ðaverage length ¼ 4:6Þ and 639 distinct aligned
secondary unit sequences ðaverage length ¼ 2:7Þ. The com-
plexity of the language model was evaluated according to
perplexity defined as the average word branching factor of a
language model.
Perplexity ¼ 2
H
¼
Y
N
i¼1
Pw
i
; jw
i1
;w
i2
; ...;w
iNþ1
ðÞ
1
N
; ð1Þ
30 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
TABLE 1
Examples of the Aligned Parallel Texts
TABLE 2
A Set of Grammatical Components
Page 3
where H is the estimated entropy, which is the average
amountofinformationinagivenwordsequence
fw
1
;w
2
; ......;w
N
g. A low value for the perplexity H
usually indicates a good language model. This work used
the bi-gram model as the baseline model for comparison
with the POS-based and two-pass alignment models.
Table 4 presents the perplexity comparison of the three
models.
2.2 Transition Balanced Corpus and Database
Development
Since the intersign region between the final region of the
preceding sign video and initial region of the succeeding sign
video is crucial for obtaining smooth video concatenation, all
possible movement transitions between two signs need to be
included in the transition-balanced corpus for sign video
filming. The TSL linguist helped annotate the hand positions
defined in the simplified sign space. The sign database
consisted of 891 two-handed gestures, 417 right-handed
gestures, and eight left-handed gestures (without compound
sign words). Fig. 4 illustrates the number of occurrences of the
initial and final positions with the right-handed gestures
displayed as a 2D gray-scale image on the simplified signing
space. Most initial and final positions of the right-handed
gestures occur in front of the face. Fig. 5 displays the number
of occurrences of initial and final positions with two-handed
gestures. Most initial and final positions of the two-handed
gestures occur in front of the chest.
To obtain all possible concatenation information between
two signs, a POS-based sign word replacement method is
presented to build a large set of sign sequences. Every sign
word in the 1,983 aligned parallel sentences was replaced by
a sign word with the same POS in the sign database (2,159
frequently used signs). For instance, given the sign sequence
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 31
Fig. 3. An instance of Chinese to TSL translation using the two-pass alignment approach.
TABLE 3
Average Number of Occurrences of Grammatical Components in Each Sentence
TABLE 4
Perplexity Evaluation Results for the Baseline, the POS-Based, and the Grammatical Component-Based Approaches
Page 4
“Mother is very busy” MotherÞ=Na; ðBusyÞ=VH; ðVeryÞ=
Dfag, the sign words with the same POS, StudentÞ=Nag and
BrotherÞ=Nag, can be applied to replace the sign word
MotherÞ=Nag and generate two expanded sign sequences.
In this example, the motion transitions: StudentÞ!ðBusyÞg
andBrotherÞ!ðBusyÞgcanbeobtained bythe replacement
method. To select a suitable sentence for video filming, a
scoring method is adopted to rank and screen the expanded
sentence S
i
¼ w
i
1
; ...;w
i
j
; ...;w
i
NðS
i
Þ
:
STSðS
i
Þ¼
Y
NðS
i
Þ
j¼1
Pw
i
j

!
1=NðS
i
Þ
Y
NðS
i
Þ1
j¼1
Pw
i
jþ1
jw
i
j

!
1=ðNðS
i
Þ1Þ
;
ð2Þ
where NðS
i
Þ is the number of sign words in sentence S
i
, P ðw
i
j
Þ
denotes the word frequency of w
i
j
, and P ðw
i
jþ1
jw
i
j
Þ represents
the bi-gram probability. This equation reveals that a corpus
with higher sentence scores contains more frequently used
and grammatically correct sign words.
Based on this corpus, a sentence selection algorithm was
designed to extract the smallest possible number of sentences
containing all possible transition information. First, the
number of occurrences of the possible transition patterns
for each sign word extracted from the expanded corpus was
filmed. The transition patterns were defined as video clips
between the final hand position of sign word w
j
and the
initial hand position of the succeeding sign word w
jþ1
. For
each sign word, a 25 25 transition number matrix was
defined with respect to 25 25 possible movements from
each of the 25 regions in the simplified signing space to the
others. An example of the transition occurrence matrix T
j
of
sign word w
j
is represented as
T
j
¼
I
1
I
2
... I
24
I
25
F
1
F
2
F
24
F
25
0 0 ... 12 0
0 0 ... 0 0
0 0 ... 0 0
1 0 ... 0 5
2
6
6
6
6
4
3
7
7
7
7
5
:
ð3Þ
A large value in the transition matrix indicates a frequently
occurring transition pattern in the expanded corpus. The
candidate sentence is chosen by considering the frequency
and the value in the transition matrix of the sign words in the
sentence. Therefore, an information scoring scheme for
selecting the most appropriate sentences is defined as follows:
SelectðS
i
Þ¼STSðS
i
Þ
NTPðS
i
Þ
TPðS
i
Þ
; ð4Þ
where NTP ðS
i
Þ denotes the number of transition patterns
in the expanded sentence S
i
not included in the balanced
corpus, and TPðS
i
Þ represents the total number of transi-
tion patterns in the expanded sentence S
i
. A large value of
NTP ðS
i
Þ=TP ðS
i
Þ demonstrates that sentence S
i
contributes
many transition patterns and, therefore, has high priority
for selection. For each transition pattern in a certain sign
word, only the sentence with the highest SelectðS
i
Þ value
can be chosen and added to the balanced corpus.
Many online resources involving sign images and videos
can be found for sign video synthesis. Most sign video
databases lack the transition information from one sign
word to the next. The initial and final gestures in these sign
video databases are generally performed in front of the
abdomen. The discontinuity problem typically appears in
the video synthesis output. This study applied an alter-
native way to acquire sign videos from the sentence-level
videos. To achieve this purpose, a systematic method of
developing a sign video database with rich information of
motion transition between sign videos.
32 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
Fig. 4. Number of occurrences of (a) initial and (b) final positions associated with right-handed gestures displayed as a 2D gray-scale image on the
simplified sign space.
Fig. 5. Number of occurrences of (a) initial and (b) final positions
associated with two-hand gestures illustrated as a 2D gray-scale image
on the simplified sign space.
Page 5
A sentence-level sign video database was developed from
the transition-balanced corpus. The imaging conditions
include a fixed distance between signer and the camera and
consistent dressing and light during a two-month filming
period. All videos were calibrated for light conditions and
signing space positions. The same linguist who conducted the
annotat ion of hand pos itions also helped segment sign
videos. Each sign video is annotated with the complete
signing process from the initial gesture to the final gesture
and the transition regions between the preceding and
succeeding sign words.
According to the sentence selection algorithm, 409 transi-
tion-balanced sentences were selected from the expanded
sentences database for inclusion in the transition-balanced
corpus. The corresponding video sequences were filmed,
segmented, and annotated based on this corpus. The linguist
helped segment the video sequences and annotate the
positions of the initial and final gestures and the transition
regions of each sign video. In total, 1,810 sign videos were
extracted. The average number of video frames in the initial
and final transition regions was 7.4 and 12.6, respectively.
3TAIWANESE SIGN TRANSLATION
IBM models [14] have been proven to be helpful for machine
translation. IBM model 2, which is based on word-by-word
alignment, models a translation sentence
e
T from source
language S with length m to target language T with length l
as follows:
e
T ¼ arg max
T
P ðTjSÞarg max
T
P ðSjT ÞPðT Þ; ð5Þ
where P ðSjT Þ denotes the translation model and P ðTÞ is the
language model of target language T. The translation
model PðSjTÞ is estimated as:
PðSjTÞ¼PðS; ajTÞ¼"ðmjlÞ
Y
m
j¼1
X
l
i¼0
tðS
j
jT
i
Þaðijj; l; mÞ; ð6Þ
where "ðmjlÞ denotes the string length probability, tðS
j
jT
i
Þ
represents the translation probabilities for translating S
j
to
T
i
, and aðijj; l; mÞ is the alignment probabilities to align
location i of source language with length m to location j of
the target language with length l.Themostprobable
sentence
e
T of the target language is obtained by IBM model 2.
e
T ¼ arg max
T
P ðTjSÞarg max
T
PðSjTÞPðT Þ
¼ arg max
T
"ðmjlÞ
Y
m
j¼1
X
l
i¼0
tðS
j
jT
i
Þaðijj; l; mÞ
!
P ðTÞ:
ð7Þ
Since IBM model 2 is a statistical model, it requires a large
corpus for modeltraining. IBM model 2 has the problem of data
sparseness for TSL translation using only a small bilingual
corpus. This study presents an approach to modeling the
sequence correspondencebetween the Chinese and TSL using
hierarchical alignment of different levels of grammatical
components. This hierarchical two-pass alignment can lower
the complexity of the standard IBM model 2 and is suitable for
translation using a small bilingual corpus. For a Chinese
sentence, function words/quantifiers are first removed to
generate the Chinese grammatical c omponent sequence,
including PUs and SUs. Consecutive SUs are further
concatenated as a complete unit and defined as a PF. For a
given Chinese sentence, a Chinese PU sequence is generated
as CP U
N
1
¼ CPU
1
;CPU
2
; ...;CPU
N
with N PUs, including
PFs. An alignment-based translation model modified from
IBM Model 2 was used for Chinese and TSL sequence
alignment. The Chinese PU sequence CP U
N
1
is aligned to
the sign PU sequence SPU
N
1
¼ SPU
1
;SPU
2
; ...;SPU
N
, with
equal sequence length, N, based on the syntax-level (first-
pass) alignment. In the phrase-level (second-pass) alignment,
the Chinese SU sequence within the ith PF PF
i
CSU
M
i
1
¼ CSU
1
;CSU
2
; ...;CSU
M
i
, where length M
i
of the
aligned signPUsequenceSPU
N
1
is further alignedtoasign SU
sequence SSU
M
i
1
¼ SSU
1
;SSU
2
; ...;SSU
M
i
. Fig. 3 shows an
example for the two-pass alignment. Table 5 shows an
example of the bilingual sentences with aligned grammatical
components. Fig. 6 shows the sign sequence {“I,” “Meeting,”
“Taipei”} translated from the Chinese sentence “I am having a
meeting in Taipei.” The sign-word “I” is a one-handed sign
and the sign-words “Meeting” and “Taipei” are two-handed
signs. Each sign can be denoted as an initial gesture (hand
shape), followed by a motion (palm orientation), and a final
gesture (hand shape). Some criteria must be considered when
modeling the intersign hand transition for sign video
synthesis.
The two-pass alignment translation model modified
from IBM Model 2 is used to choose the sign sequence
S
N
1
(length N) with the highest probability among all possible
target sequences according to Bayes’ decision rule:
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 33
TABLE 5
Examples of the Aligned Parallel Texts Using Grammatical Components
Page 6
S
N
1
¼ arg max
S
P
S
N
1
jCPU
N
1
¼ arg max
S
P
CP U
N
1
jS
N
1
P
S
N
1
¼ arg max
S
X
SP U
N
1
P
CP U
N
1
;SPU
N
1
jS
N
1
0
@
1
A
P
S
N
1
:
ð8Þ
Although all possible sign PU sequences SPU
N
1
should be
included in the estimation of the conditional probability,
only the maximally aligned sign PU sequence SPU
N
1
is
considered herein. Therefore, the equation can be simpli-
fied as
S
N
1
arg max
S
max
SPU
N
1
P
CP U
N
1
;SPU
N
1
jS
N
1
P
S
N
1
!
arg max
S
P
CPU
N
1
jSPU
N
1
P
SPU
N
1
jS
N
1
P
S
N
1
¼ arg max
S
P
CPU
N
1
jSPU
N
1
X
SSU
N
1
P
SPU
N
1
;SSU
N
1
jS
N
1
0
@
1
A
P
S
N
1
arg max
S
P
CPU
N
1
jSPU
N
1
P
SPU
N
1
jSSU
N
1
P
SSU
N
1
jS
N
1
P
S
N
1
arg max
S
P
CPU
N
1
jSPU
N
1
P
SPU
N
1
jSSU
N
1
P
SSU
N
1
;S
N
1
;
ð9Þ
where P ðCP U
N
1
jSPU
N
1
Þ and P ðSPU
N
1
jSSU
N
1
Þ denote the
alignment probabilities for the primary and secondary units,
respectively. P ðSSU
N
1
;S
N
1
Þ is defined as the TSL syntactic-
visual language model probability considering both the
syntactic and visual information of signs for intersign
epenthesis. These three probabilities are described in the
following sections.
3.1 Alignment Probability Estimation
Consider the sequence of primary and secondary units,
each of which is independent of the others. The modified
alignment models can be estimated as follows:
P
CP U
N
1
jSPU
N
1
¼
X
a
N
1
P
CPU
N
1
;a
N
1
jSPU
N
1
¼
X
a
N
1
Y
N
j¼1
P ða
j
jj
P
CP U
j
jSPU
a
j
!
;
ð10Þ
P
SPU
N
1
jSSU
N
1
¼
X
b
N
1
P
SPU
N
1
;b
N
1
jSSU
N
1
¼
X
b
N
1
Y
N
j¼1
Pðb
j
jjÞP
SPU
j
jSSU
b
j
!
;
ð11Þ
where a
j
is the alignment mapping j ! i ¼ a
j
, which aligns
CPU
j
in position j to SPU
i
in position i ¼ a
j
, and P ða
j
jjÞ
and P ðb
j
jjÞ denote the PU and SU alignment probabilities,
respectively. Fig. 7 depicts the alignment concept.
3.2 TSL Language Model Probability Considering
Intersign Epenthesis
In statistical machine translation, a language model of the
target language can be adopted to validate the translation
result. The probability of the syntactic-visual language model
PðSSU
N
1
;S
N
1
Þ, based on both syntactic and visual informa-
tion of signs for intersign epenthesis, is defined as follows:
P
SSU
N
1
;S
N
1
Y
N
i¼1
P
SSU
i
jSSU
i1
IE
S
i1
;S
i
hi
;
ð12Þ
where P ðSSU
i
jSSU
i1
1
Þ denotes the N-gram language model
probability and IEð S
i1
;S
i
Þ, defined in (15), represents the
score for intersign epenthesis based on the visual informa-
tion. For the problem of sparse data, the bi-gram language
model and Long-Distance Information (LDI) are applied to
34 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
Fig. 6. Illustration of the sign sequence {I, meeting, Taipei} translated
from the Chinese sentence “I am having a meeting in Taipei”, and video
concatenation.
Fig. 7. Illustration of the first-pass alignment.
Page 7
smooth the language model. The syntactic-visual language
model probability PðSSU
N
1
;S
N
1
Þ is further defined as
P ðSSU
N
1
;S
N
1
Þ
¼
Y
N
i¼1
P
SSU
i
jSSU
i1
max
k¼1i1
LDI
SSU
k
;SSU
i
!
1
2
IEðS
i1
;S
i
Þ:
ð13Þ
The LDI between SSU
k
and SSU
i
of the varying distance is
estimated as follows:
LDIðSSU
k
;SSU
i
Þ¼
P ðSSU
k
;SSU
i
Þ
P ðSSU
k
Þ
; 1 k i 1: ð14Þ
In this equation, if
LDIðSSU
kð1ki1Þ
;SSU
i
Þ > LDIðSSU
i1
;SSU
i
Þ;
then the language model probability can be increased by
considering the long-distance information; otherwise, the
language model probability equals the bi-gram probability.
The intersign epenthesi s score, which estimates the
transition smoothness for intersign epenthesis based on
the direction and distance between two consecutive signs, is
defined as follows:
IEðS
i1
;S
i
Þ¼
1
1 þ e
DðS
i1
;S
i
Þ
; ð15Þ
DðS
i1
;S
i
Þ¼ D
l
ðCurveF ðS
i1
Þ; CurveIðS
i
ÞÞ
þð1 ÞD
ðCurveF ðS
i1
Þ; CurveIðS
i
ÞÞ;
ð16Þ
where is the weighting factor, which is calculated
experimentally. CurveF ðS
i1
Þ represents the final part of
the signing curve of sign S
i1
in the intersign transition.
Similarly, CurveIðS
i
Þ denotes the initial part of the signing
curve of sign S
i
in the intersign transition. D
l
ðCurveF
ðS
i1
Þ; CurveIðS
i
ÞÞ is defined as the distance between the
hand positions of the frame in the final part of the signing
curve of sign S
i1
and the frame in the initial part of the
signing curve of sign S
i
based on the Euclidean distance:
D
l
ðCurveF ðS
i1
Þ; CurveIðS
i
ÞÞ
¼ D
l
ðZðx
F
i1
;y
F
i1
Þ;Zðx
I
i
;y
I
i
ÞÞ
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðx
I
i
x
F
i1
Þ
2
þðy
I
i
y
F
i1
Þ
2
q
;
ð17Þ
where denotes the normalization factor, which equals the
diagonal distance in the video frame and Zðx; yÞ is the curve
function of a given hand motion trajectory, created by the
curve fitting algorithm. D
ðCurveF ðS
i1
Þ; CurveIðS
i
ÞÞ is
determined with the included angles between the tangent
direction from the hand position of the frame in the final
part of the signing curve of sign S
i1
to the hand position of
the frame in the initial part of the signing curve of sign S
i
:
D
ðCurveF ðS
i1
Þ; CurveIðS
i
ÞÞ ¼ cos
1
F
i1
I
i
kF
i1
kkI
i
k

;
ð18Þ
where denotes the normalization factor defined as 1=.
The tangent vector is calculated by differentiating the curve
function Zðx; yÞ at the specific point ðx; yÞ. These two
vectors, F
i1
and I
i
, are defined as follows:
F
i1
¼
@Zðx
F
i1
;y
F
i1
Þ
@x
F
i1
;
@Zðx
F
i1
;y
F
i1
Þ
@y
F
i1

; ð19Þ
I
i
¼
@Zðx
I
i
;y
I
i
Þ
@x
I
i
;
@Zðx
I
i
;y
I
i
Þ
@y
I
i

: ð20Þ
Fig. 8 illustrates the calculation of the optimal points for
concatenating two sign videos based on the distance and
the tangent direction between the preceding sign S
i1
and
the succeeding sign S
i
.
4RESULTS AND DISCUSSION
Several experiments were conducted to evaluate the quality
of translation and synthesis of Taiwanese sign videos. The
TOP-N measure was applied to determine the translation
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 35
Fig. 8. Determination of the optimal points for concatenating two sign videos based on the distance and the tangent direction between the preceding
and succeeding signs.
Page 8
quality. The trajectory was analyzed to show the smooth-
ness of the synthesized videos. In subjective evaluation on
understanding, deaf primary school students were invited
to watch the synthesized videos of the question and give the
corresponding answers. These evaluations are described in
the following sections.
4.1 Evaluation on the Performance of Translation
In the evaluation of the translation performance, the V -fold
cross-validation technique [34] was used to ensure that the
training data were utilized effectively. The corpus was thus
split into 10 disjoint subsets, i.e., V ¼ 10.Eachsubset
containing about 198 aligned parallel sentences was used in
turn as an independent test set while the remaining V 1
subsets containing 1,785 aligned parallel sentences were used
for alignment model training. If the target sign sequence of
interest was included in the Top-N candidate list of the
sentences translated from a given test Chinese sentence, then
the translation was considered as correct. Fig. 9 displays the
comparison results between the proposed two-pass align-
ment approach and IBM Model 2. The vertical axis indicates
the correct translation rate, and the horizontal axis indicates
the number o f candidates, represented as Top-N. The
proposed approach outperformed IBM Model 2 and had a
lower alignment complexity. The performance of IBM Model 2
was limited by the small corpus size in this task. The correct
translation rates of Top-1 and Top-5 were 75.1 and 84.7 per-
cent, respectively. Some sign sentences ranked among Top-5
were also found to be acceptable in terms of TSL. These
sentences with different word orders express different
senses. When considering this factor, the Top-5 correct
translation rate in this experiment reaches 91.1 percent.
4.2 Evaluation on the Performance of Sign Video
Synthesis
A TSL learning system was developed to evaluate the system
performance in practice. Fig. 10 shows the interface of the TSL
learning system. The system was implemented in Visual C++
version 6.0 under the Windows 2000 platform with P4 2.0GHz
CPU and 512 MB RAM. Two input modes of text and speech
were provided. This system translates an input Chinese
sentence into a TSL sign sequence and outputs the synthe-
sized video sequence. Five profoundly deaf students (three
male and two female native signers) in the sixth grade at the
primary school in Tainan, Taiwan, evaluated the utility of the
proposed approach as a practical learning aid.
4.2.1 Evaluation on the Sign Video Concatenation
The distance and the tangent direction between the hand
positions were considered in sign video concatenation, as
described in Section 3.2. The effects of the factors were
investigated by subjective and objective evaluations of
four cases:
1. considering both distance and direction,
2. considering only direction,
3. considering only distance, and
4. direct concatenation.
The weighting factor in (16) was set to 0.5.
In the objective evaluation, the hand’s motion trajectory of
a certain sign video sequence, chosen from the sentence-level
sign videos of the developed database, was compared with
the proposed approach using Cases 1, 2, 3, and 4. Fig. 11
shows the hand position curves of the synthesized frames
(from frames 13 23 and 37 43) for Cases 1, 2, 3, and the
ground truth sign video. The hand position is represented by
X and Y-coordinates. The position is tracked automatically by
a hand position tracking algorithm. In this figure, Case 3
achieved the smallest root mean square (RMS) error con-
sidering only the distance of hand transitions. However, in
36 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
Fig. 9. Comparison of the proposed modified alignment approach and
IBM model 2.
Fig. 10. Interface of the Taiwanese sign language learning system.
Fig. 11. An example of moving trajectories between two concatenated
sign videos for four concatenation cases.
Page 9
this study, the smoothness is defined as the metric consider-
ing both distance and direction. Accordingly, a subjective
evaluation was further conducted to evaluate the naturalness
of the concatenated sign v ideo outputs. In subjec tive
evaluation, three ordinal opinions, good, fair, and poor, were
determined by the viewers/subjects based on the synthesized
sign videos. Thirty-five sign sequences, randomly selected
from the aligned parallel corpus, were used as the test corpus
for each subject. Fig. 12 shows the subjective evaluation
results. A
2
-test yielded significance levels of p<0:001,
revealing significant differences among Cases 1, 2, 3, and 4.
The horizontal axis indicates the subjective opinions, and the
vertical axis indicates the histogram of the three opinions.
Experimental results indicat e that the synthesized sign
videos for Case 1 outperformed the other cases.
4.2.2 Case Study
A subjective evaluation, based on the reading comprehension
of synthesized sign videos by the subjects, was undertaken.
The performance evaluation used three ordinal opinions,
good,fair, and poor.The text sentencesfor testing werechosen
from questions in the teaching materials. Two test sentences
containing 20 long sentences ðaverage length ¼ 8:2Þ and
20 short sentences ðaverage length ¼ 4:4Þ were also consid-
ered when investigating the performance of TSL translation.
Fig. 13 displays the evaluation results. The horizontal axis
indicates the subjective opinions, and the vertical axis
indicates the histogram of the three opinions. Experimental
results show that the performance of reading comprehension
achieved 75 percent with 43 percent fair and 32 percent good
opinions. The performance for short sentences outperformed
that for long sentences. A
2
-test yielded significance levels of
p<0:001, revealing significant differ ences between the
results of the two methods. The translation results indicate
that a large number of phrase fragments in a long sentence
leads to a low translation accuracy. For time complexity
evaluation, Table 6 lists the time requirement for both
translation and video synthesis.
5CONCLUSIONS
This work has presented an innovative approach to joint
optimization of TSL translation and sign video synthesis for
teaching and learning TSL. In the translation part, an
alignment-based translation model, modified from IBM
model 2, is presented to transliterate the Chinese into TSL by
estimating the aligned probabilities of the grammatical
structure. Grammatical components are useful for modeling
the transliterated correspondences and for reducing the
perplexity of the alignment model. The long-distance
information approach and intersign epenthesis score can
be used to acquire local relationships within TSL grammar
at either the syntactic or the visual level and reinforce the
probability estimation using bi-gram model. Experimental
results in the evaluation of the aligned parallel corpus reveal
that the proposed translation framework outperformed IBM
Model 2.
The proposed sign video database provides rich informa-
tion of motion transitions between sign videos for displaying
the translation result. The proposed sign video concatenation
approach searches the optimal sequence of sign video clips
among the possible translated TSL sequences by computing
the maximum epenthesis score based on the distance and
direction of hand’s positions. Experimental results in the
subjective and objective evaluation reveal that the proposed
sign language synthesis approach outperformed the ap-
proach described in [14]. The case study also indicates that the
proposed TSL learning system performed well for reading
comprehension.
Current work on sign language processing focuses on
building gesture recognition systems with large vocabul-
aries. Grammatical inflection [11] is a difficult issue for the
recognition of continuous signing sequences in these
systems and also affects the reconcilability of the synthe-
sized results in sign synthesis. Different inflections of a sign
in the sentences for sign video filming in this study.
However, collecting enough data for all inflections is often
difficult. Fortunately, speakers can correct and, therefore,
understand the synthesized sign output with inflection
based on their knowledge and experiences. Problems with
inflections for translation from written language to gesture
language need to be studied in the future. To achieve this
requirement, the phonological constraints of TSL and the
effects of sign inflection could be applied to improve the
naturalness and fluency of sign expression.
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 37
Fig. 12. Subjective evaluation results of sign video synthesis.
Fig. 13. Case study results.
TABLE 6
Evaluation of Time Requirement
Page 10
APPENDIX A
Table 7 shows the grammatical components and POSs
defined in this study.
ACKNOWLEDGMENTS
The authors would like to thank the National Science Council
of the Republic of China, Taiwan, for financially supporting
this research under Contract No. NSC 94-2614-E-006-073.
REFERENCES
[1] S. Wilcox and P.P. Wilcox, Learning to See. Gallaudet Univ. Press,
1997.
[2] F. Alonso, A. Antonio, J.L. Fuertes, and C. Montes, “Teaching
Communicati on Skills to Hearing-Impaired Children,” IEEE
Multimedia, pp. 55-67, 1995.
[3] C. Brown, “Assistive Technology Computers and Persons with
Disabilities,” Comm. ACM, vol. 5, pp. 36-46, 1992.
[4] D.L. Speers, “Representation of American Sign Language for
Machine Translation,” PhD dissertation, Graduate School of Arts
and Sciences, Georgetown Univ., 2001.
[5] L.L. Lloyd, D.R. Fuller, and H.H. Arvidson, Augmentative and
Alternative Communication: A Handbook of Principles and Practices.
Allyn and Bacon, Inc., 1997.
[6] C. Vogler and D. Metaxas, “Toward Scalability in ASL Recogni-
tion: Breaking Down Signs into Phonemes,” Lecture Notes in
Artificial Intelligence, vol. 1739, pp. 211-224, 1999.
[7] C. Vogler and D. Metaxas, “A Framework for Recognizing the
Simultaneous Aspects of American Sign Language,” Computer
Vision and Image Understanding, no. 81, pp. 358-384, 2001.
[8] T. Starner, J. Weaver, and A. Pentland, “Real-Time American Sign
Language Recognition Using Desk and Wearable Computer-
Based Video,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 20, no. 12, pp. 1371-1375, Dec. 1998.
[9] M.C. Su, Y.X. Zhao, H. Huang, and H.F. Chen, “A Fuzzy Rule-
Based Approach to Recognizing 3-D Arm Movements,” IEEE
Trans. Neural Systems and Rehabilitation Eng., vol. 9, no. 2, 2001.
[10] R. Liang, “Continuous Gesture Recognition System for Taiwanese
Sign Language,” PhD dissertation, Nat’l Taiwan Univ., 1997.
[11] S.C.W. Ong and S. Ranganath, “Aut omatic Sign Language
Analysis: A Survey and the Future beyond Lexical Meaning,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6,
pp. 873-891, June 2005.
[12] C.C. Manning and H. Schu
¨
tze, Foundations of Statistical Natural
Language Processing. MIT Press, 1999.
[13] W. Chou and B.H. Juang, Pattern Recognition in Speech and
Language Processing. CRC Press, 2003.
[14] P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. Mercer, “The
Mathematics of Statistical Machine Translation: Parameter Estima-
tion,” Computational Linguistics, vol. 19, no. 2, pp. 263-311, 1993.
[15] H. Ney, S. Niessen, F. Och, H. Sawaf, C. Tillmann, and S. Vogel,
“Algorithms for Statistical Translation of Spoken Language,” IEEE
Trans. Speech and Audio Processing, vol. 8, no. 1, pp. 24-36, 2000.
[16] R. Kennaway, “Synthetic Animation of Deaf Signing Gestures,”
Proc. Fourth Int’l Workshop Gesture and Sign Language Based Human-
Computer Interaction, 2001.
[17] A. Irving and R. Foulds, “A Param etric Approach to Sign
Language Synthesis,” Proc. SIGACCESS, pp. 212-213, 2005.
[18] Y. Chen, W. Gao, G. Fang, C. Yang, and Z. Wang, “CSLDS:
Chinese Sign Language Dialog System,” Proc. IEEE Int’l Workshop
Analysis and Modeling of Faces and Gestures, pp. 236-237, 2003.
[19] A.B. Grieve-Smith, “SignSynth: A Sign Language Synthesis
Application Using Web3D and Perl,” Proc. Gesture Workshop,
pp. 134-145, 2001.
[20] E.J. Holden, J.C. Wong, and R. Owens, “An Effective Sign
Language Display System,” Proc. Eighth Int’l Symp. Signal
Processing and Its Applications, vol. 1, pp. 54-57, 2005.
[21] O. Arikan and D.A. Forsyth, “Interactive Motion Generation from
Examples,” Proc. 29th Ann. Conf. Computer Graphics and Interactive
Techniques, pp. 483-490, 2002.
[22] Y. Li, T. Wang, and H.Y. Shum, “Motion Texture: A Two-Level
Statistical Model for Character Motion Synthesis,” ACM Trans.
Graphics, vol. 21, no. 3, pp. 465-472, 2002.
[23] L. Kovar, M. Gleicher, and F. Pighin, “Motion Graphs,” Proc. ACM
SIGGRAPH, pp. 473-482, 2002.
[24] J. Lee, J. Chai, P.S.A. Reitsma, J.K. Hodgins, and N.S. Pollard,
“Interactive Control of Avatars Animated with Human Motion
Data,” Proc. ACM SIGGRAPH, pp. 491-500, 2002.
[25] S.W. Kim, Z.X. Li, and Y. Aoki, “On Intelligent Avatar Commu-
nication Using Korean, Chinese and Japanese Sign-Languages: An
Overview,” Proc. Eighth Control, Automation, Robotics and Vision
Conf., vol. 1, pp. 747-752, 2004.
[26] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin, “Real-Time Speech
Motion Synthesis from Recorded Motions,” Proc. ACM SIG-
GRAPH Eurographics Symp. Computer Animation, pp. 347-355, 2004.
[27] T. Ezzat, G. Geiger, and T. Poggio, “Trainable Video-Realistic
Speech Animation,” Proc. ACM SIGGRAPH, vol. 21, pp. 388-397,
2002.
[28] C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving
Visual Speech with Audio,” Proc. ACM SIGGRAPH, pp. 353-360,
1997.
[29] F. Solina and S. Krape
z, “Synthesis of the Sign Language of the
Deaf from the Sign Video Clips,” Electrotechnical Rev., vol. 66,
pp. 260-265, 1999.
[30] Ministry of Education, Division of Special Education, Changyong
Cihui Shouyu Huace (Sign Album of Common Words), vol. 1. Taipei:
Ministry of Education, 2000.
[31] “The Chinese Knowledge Information Processing Group, Analysis
of Chinese Part of Speech,” CKIP Technical Report, no. 93-05, Inst. of
Information Science, Academic Sinica, Taipei, 1993 (in Chinese).
[32] Z. Dong, The HowNet Web Site, http://www.keenage.com, 1999.
[33] Inst. of Linguistics, Nat’l Chung Cheng Univ., Chiayi, Taiwan,
Proc. Int’l Symp. Taiwan Sign Language Linguistics, http://
www.ccunix.ccu.edu.tw/~lngsign/tsl-links-e.htm, 2003.
38 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 1, JANUARY 2007
TABLE 7
Grammatical Components and POSs Defined in This Study
Page 11
[34] P.A. Lachenbruch and M.R. Mickey, “Estimation of Error Rate in
Discriminant Analysis,” Technometrics, pp. 1-11, 1968.
[35] S. Shott, Statistics for Health Professionals. W.B. Sauders, 1990.
[36] C.H. Wu, Y.H. Chiu, and C.S. Guo, “Text Generation from
Taiwanese Sign Language Using a PST-Based Language Model for
Augmentative Communication,” IEEE Trans. Neural Systems and
Rehabilitation Eng., vol. 12, no. 4, pp. 441-454, 2004.
[37] C.H. Wu, Y.H. Chiu, and K.W. Cheng, “Error-Tolerant Sign
Retrieval Using Visual Features and Maximum A Posteriori
Estimation,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 26, no. 4, pp. 495-508, Apr. 2004.
Yu-Hsien Chiu received the BS degree in
electrical engineering from I-Shou University,
Kaohsiung, Taiwan, ROC, in 1997, and the MS
degree in biomedical engineering from National
Cheng Kung University, Tainan, Taiwan, in
1999. He received the PhD degree from the
Department of Computer Science and Informa-
tion Engineering, National Cheng Kung Univer-
sity, Tainan, Taiwan, in 2004. His research
interests include speech and biomedical signal
processing, embedded system design, spoken language processing,
and sign language processing for the hearing-impaired.
Chung-Hsien Wu received the PhD degree in
electrical engineering from National Cheng Kung
University, Tainan, Taiwan, ROC, in 1991. Since
August 1991, he has been with the Department
of Computer Science and Information Engineer-
ing, National Cheng Kung University, Tainan,
Taiwan. He became a professor in August 1997.
He is currently the editor-in-chief for the Inter-
national Journal of Computational Linguistics
and Chinese Language Processing. His re-
search interests include speech recognition, text-to-speech, and multi-
media information retrieval. Dr. Wu is a senior member of IEEE. He is also
a member of International Speech Communication Association (ISCA)
and ROCLING.
Hung-Yu Su received the BS and MS degrees
from the Department of Computer Science and
Information Engineering, National Cheng Kung
University, Tainan, Taiwan, in 2001 and 2003,
respectively. Currently, he is a PhD student in the
Department of Computer Science and Informa-
tion Engineering, National Cheng Kung Univer-
sity, Tainan, Taiwan. His research interests
include natural language processing, machine
translation, and sign language processing for the hearing-impaired.
Chih-Jen Cheng received the BS degree in
information engineering from Yuan Ze Univer-
sity, Chung-Li, Taiwan, in 2001, and the MS
degree in information engineering from National
Cheng Kung University, Tainan, Taiwan, in
2003. Hi s research interests include digital
signal processing, natural language processing,
machine translation, and image processing.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
CHIU ET AL.: JOINT OPTIMIZATION OF WORD ALIGNMENT AND EPENTHESIS GENERATION FOR CHINESE TO TAIWANESE SIGN... 39
Page 12
  • Source
    • "one-handed or twohanded , general position of the hands etc.) that need to be concatenated. This approach of synthesizing sign language by concatenation of video clips was later used or further developed by several other creators of sign language dictionaries181920. Words which are not in the dictionary or proper names can be shown by spelling and using video clips which contain the signs for individual letters. "
    [Show abstract] [Hide abstract] ABSTRACT: The article describes technical and user-interface issues of transferring the contents and functionality of the CD-ROM version of the Slovenian sing language dictionary to the web. The dictionary of Slovenian sign language consist of video clips showing the demonstra- tion of signs that deaf people use for communication, text description of the words corresponding to the signs and pictures illustrating the same word/sign. A new technical solution—a video sprite—for concatenating subsections of video clips necessary for their smooth display on most available platforms was developed. The contents of the dictionary which were re-edited are combined also with other resources available on the web. Added were also new exercises for learning the sign language.
    Full-text · Conference Paper · Jul 2013
  • Source
    • "The authors of [14] have explored example-based MT approaches for the language pair English and sign language of the Netherlands with further developments being made in the area of Irish sign language. In [5], a system is presented for the language pair Chinese and Taiwanese sign language. The optimizing methodologies are shown to outperform a simple SMT model. "
    [Show abstract] [Hide abstract] ABSTRACT: We present an approach to automatically recognize sign language and translate it into a spoken language. A system to address these tasks is created based on state-of- the-art techniques from statistical machine translation, speech recognition, and image processing research. Such a system is necessary for communication between deaf and hearing people. The communication is otherwise nearly impossible due to missing sign language skills on the hearing side, and the low reading and writing skills on the deaf side. As opposed to most current approaches, which focus on the recognition of isolated signs only, we present a system that recognizes complete sentences in sign language. Similar to speech recognition, we have to deal with temporal sequences. Instead of the acoustic signal in speech recognition, we process a video signal as input. Therefore, we use a speech recognition system to obtain a textual representation of the signed sentences. This intermediate representation is then fed into a statistical machine translation system to create a translation into a spoken language. To achieve good results, some particularities of sign languages are considered in both systems. We use a publicly available corpus to show the performance of the proposed system and report very promising results.
    Full-text · Article · Apr 2012
  • Source
    • "• (Chiu et al., 2007) present a system for the language pair Chinese and Taiwanese sign language. The optimizing methodologies are shown to outperform IBM model 2. "
    [Show abstract] [Hide abstract] ABSTRACT: In this paper, we describe the first data-driven automatic sign-language-to- speech translation system. While both sign language (SL) recognition and translation techniques exist, both use an intermediate notation system not directly intelligible for untrained users. We combine a SL recognizing framework with a state-of-the-art phrase-based machine translation (MT) system, using corpora of both American Sign Language and Irish Sign Language data. In a set of experiments we show the overall results and also illustrate the importance of including a vision-based knowledge source in the development of a complete SL translation system.
    Full-text · Article · Apr 2012
Show more