Conference PaperPDF Available

English-ASL Gloss Parallel Corpus 2012: ASLG-PC12

Authors:
  • Mada Center
  • Arab League Educational, Cultural and Scientific Organization (ALESCO), Tunis, Tunisia

Abstract and Figures

A serious problem facing the community of researchers in the field of sign language is the absence of a large parallel corpus for signs language. The ASLG-PC12 project proposes a rule-based approach for building a big parallel corpus of English written texts and American Sign Language glosses. We present a novel algorithm that transforms an English part-of-speech sentence to an ASL gloss. This project was started in the beginning of 2011 as a part of the project WebSign, and it offers today a corpus containing more than one hundred million pairs of sentences between English and ASL glosses. It is available online for free to promote development and design of new algorithms and theories for American Sign Language processing, for example statistical machine translation and related fields. In this paper, we present tasks for generating ASL sentences from the Gutenberg Project corpus that contains only English written texts.
Content may be subject to copyright.
English-ASL Gloss Parallel Corpus 2012: ASLG-PC12
Achraf Othman, Mohamed Jemni
Research Laboratory LaTICE, University of Tunis
5, Av. Taha Hussein, B.P. 56, Bab Mnara, 1008 Tunis, Tunisia
E-mail: achraf.othman@ieee.org, mohamed.jemni@fst.rnu.tn
Abstract
A serious problem facing the community of researchers in the field of sign language is the absence of a large parallel corpus for
signs language. The ASLG-PC12 project proposes a rule-based approach for building a big parallel corpus of English written texts
and American Sign Language glosses. We present a novel algorithm that transforms an English part-of-speech sentence to an ASL
gloss. This project was started in the beginning of 2011 as a part of the project WebSign, and it offers today a corpus containing
more than one hundred million pairs of sentences between English and ASL glosses. It is available online for free to promote
development and design of new algorithms and theories for American Sign Language processing, for example statistical machine
translation and related fields. In this paper, we present tasks for generating ASL sentences from the Gutenberg Project corpus that
contains only English written texts.
Keywords: American Sign Language, Parallel Corpora, Sign Language
1. Introduction
To develop an automatic translator or any other tool that
requires a learning task for Sign Languages, the major
problem is the collection of parallel data between text
and Sign Language. A parallel corpus contains large and
structured texts aligned between source and target
languages. They are used to do statistical analysis and
hypothesis testing, checking occurrences or validating
linguistic rules on a specific universe. Since there is no
standard and sufficient corpus for Sign Language
(Morrissey & Way, 2007; Morrissey S. , 2008), to
develop statistical machine translation that requires
pre-treatment prior to the execution of the process of
learning which needs an important volume of data.
For these reasons, we started to collect pairs of sentences
between English and American Sign Language Gloss.
And due to absence of data, especially in ASL and in
other side there exists a huge data of English written text;
we have developed a corpus based on a collaborative
approach where experts can contribute in the collection
and in correction of bilingual corpus and also in
validation of the automatic translation. Experts are
people that are authorized to validate translations and
correct suggestions of translations. ASLG-PC12 project
(Othman & Jemni, 2011) was started in 2010, as a part of
the project WebSign (Jemni & El Ghoul, 2007) that
carries on developing tools able to make information
over the web accessible for deaf. The main goal of our
project WebSign is to develop a Web-based interpreter
of Sign Language (SL). This tool would enable people
who do not know Sign Language to communicate with
deaf individuals. Therefore, contribute in reducing the
language barrier between deaf and hearing people. Our
secondary objective is to distribute this tool on a
non-profit basis to educators, students, users, and
researchers, and to disseminate a call for contribution to
support this project mainly in its exploitation step and to
encourage its wide use by different communities.
In this paper, we review our experiences with
constructing one such large annotated parallel corpus
between English written text and American Sign
Language Gloss –the ASLG-PC12 (Othman & Jemni,
2011), a corpus consisting of over one hundred million
pairs of sentences.
The paper is organized as follow. Section 2 presents a
brief description about American Sign Language Gloss.
Section 3 presents methods and pre-processing tasks for
collecting data from the Gutenberg Project (Lebert,
2008). We present two stages of pre-processing, in which
each sentences had been extracted and tokenized. After,
we present our method and algorithms for constructing
the second part of the corpus in American Sign
Language Gloss. Constructed texts were generated
automatically by transformation rules and then corrected
by human experts in ASL. We describe also the
composition and the size of the corpus. Discussions and
conclusion are drawn in section 5.
2. Background
Several projects, concerned with Sign Language,
recorded or annotated their own corpora, but only few of
them are suitable for automatic Sign Language
translation due to the number of available data for
learning and processing. The European Cultural Heritage
Online organization (ECHO) published corpora for
British Sign Language (Woll, Sutton-Spence, & Waters,
2004), Swedish Sign Language (Bergman & Mesch,
2004) and the Sign Language of the Netherlands
(Crasborn, Kooij, Nonhebel, & Emmerik, 2004). All of
the corpora include several stories signed by a single
signer. The American Sign Language Linguistic
Research group at Boston University published a corpus
in American Sign Language (Athitsos, et al., 2010). TV
broadcast news for the hearing impaired are another
source of sign language recordings. Aachen University
published a German Sign Language Corpus of the
Domain Weather Report (Bungeroth, Stein, Dreuw,
Zahedi, & Ney, 2006). In 2010, Sara et al., (Morrissey,
Somers, Smith, Gilchrist, & Dandapat, 2010) published a
multimedia corpus in Sign Language for machine
Translation. In literature, we found many related projects
151
aiming to build corpus for Sign Language. Most of them
are based on video recording and we cannot find textual
data toward building translation memory. Textual data
for Sign Language is not a simple written form, because
signs can contain others information line eye gaze or
facial expressions. So, for our corpus, we will use
glosses to represent Sign Language. In the next section,
we will present a brief description about glosses.
3. Glossing signs
Stokoe (Stokoe, 1960) proposed the first annotation
system for describing Sign Language. Before, signs were
thought of as unanalyzed wholes, with no internal
structure. The Stokoe notation system is used for writing
American Sign Language using graphical symbols. After,
others notation systems appeared like HamNoSys
(Prillwitz & Zienert, 1990) and SignWriting (Sutton &
Gleaves, 1995). Furthermore, Glosses are used to write
signs in textual form. Glossing means choosing an
appropriate English word for signs in order to write them
down. It is not a translating, but, it is similar to
translating. A gloss of a signed story can be a series of
English words, written in small capital letters that
correspond to the signs in ASL story. Some basic
conventions used for glossing are as follows:
Signs are represented with small capital letters in
English.
Lexicalized finger-spelled words are written in small
capital letters and preceded by the ‘#’ symbol.
Full finger-spelling is represented by dashes between
small capital letters (for example, A-C-H-R-A-F).
Non-manual signals and eye-gaze are represented on
a line above the sign glosses.
In this work, we use glosses to represent Sign Language.
In the next section, we will describe steps for building
our corpus.
4. English-ASL Parallel Corpus
3.1 Problematic issues
As we say in the beginning, the main problem to process
American Sign Language for statistical analysis like
statistical machine translation is the absence of data
(corpora or corpus), especially in Gloss format. By
convention, the meaning of a sign is written
correspondence to the language talking to avoid the
complexity of understanding. For example, the phrase
“Do you like learning sign language?” is glossed as
“LEARN SIGN YOU LIKE?”. Here, the word “youis
replaced by the gloss “YOU” and the word "learn-ing" is
rated "LEARN". Our machine translate must generate,
after learning step, the sentence in gloss of an English
input.
3.2 Ascertainment and approach
Generally, in research on statistical analysis of sign
language, the corpus is annotated video sequences. In our
case, we only need a bilingual corpus, the source
language is English and the language is American Sign
Language glosses transcribed. In this study, we started
from 880 words (English and ASL glosses) coupled with
transformation rules. From these rules, we generated a
bilingual corpus containing 800 million words. In this
corpus, it is not interested in semantics or types of verbs
used in sign language verbs such as "agreement" or
"non-agreement". Figure 1 shows an example of
transformation between written English text and its
generated sentence in ASL. The input is “What did
Bobby buy yesterday?” and the target sentence is
“BOBBY BUY WHAT YESTERDAY?”. In this
example, we save the word “YESTERDAY” and we can
found in some reference “PAST” which indicates the
past tense and the action was made in the past. Also, for
the symbol “?” it can be replaced by a facial animation
with “WHAT”. For us, we are based on lemmatization of
words. We keep the maximum of information in the
sentence toward developing more approaches in these
corpora. Statistics of corpora are shown in Table 1. The
number of sentences and tokens is huge and building
ASL corpus takes more than one week.
Figure 1: An example of transformation: English input
‘What did Bobby buy yesterday?’
Figure 2: Steps for building ASL corpora
152
The input of the system is English sentences and the
output is the ASL transcription in gloss. In table 2, only
simple rules are shown, we can define complex rule
starting from these simple rules. We can define a
part-of-speech sentence for the two languages.
According to figure 3, when we check if the rule of S
exists in database, the algorithm will return true, in this
case, we apply directly the transformation. Of course, all
complex rules must be created by experts in ASL. Table
2 shows some transformation from English sentence to
American Sign Language. We present the
transformation rule made by an expert in linguistics.
Corpus size English
Corpus size ASL Gloss
tokens
tokens
sentences
PART 1
280 M
13 M
280 M
13 M
PART 2
323 M
16 M
323 M
16 M
PART 3
549 M
27 M
549 M
27 M
PART 4
292 M
14 M
292 M
14 M
PART 5
150 M
7 M
150 M
7 M
Table 1. Size of the American Sign Language Gloss
Parallel Corpus 2012 (ASLG-PC12)
English sentence: what is your name?
ASL sentence: IX-PRO2 NAME, WHAT?
Transformation rule:
1_VBP 2_PRP 3_JJ 4_. ! 2_PRP 0_DESC- 3_JJ 4_.
English sentence: Are you deaf?
ASL sentence: IX-PRO2 DESC-DEAF?
Transformation rule:
1_VBP 2_PRP 3_DT 4_NN 5_. ! 4_NN 2_PRP 5_.
English sentence: are you a student?
ASL sentence: STUDENT IX-PRO2?
Transformation rule:
1_VBP 2_PRP 3_DT 4_NN 5_. ! 4_NN 2_PRP 5_.
English sentence: do you understand him?
ASL sentence: IX-PRO2 UNDERSTAND IX-PRO3?
Transformation rule:
1_VB 2_PRP 3_VB 4_PRP ! 2_PRP 3_VB 4_PRP
Table 2. Example of full sentences transformation rules
In figure 2, we describe steps to transform an English
sentence into American Sign Language gloss. The input
of the system is the English sentence. Using CoreNLP
tool, we generate an XML file containing morphological
information about the sentence after tokenization task.
Then, we build the part-of-speech sentence and thanks to
the transformation rules database, we try to transform the
input for each lemma. In some case, we can found that
the part-of-speech sentence doesn’t exist in the data-base,
so, we transform each lemma. Transformation rule for
lemma is presented in table 3. In the last step, we add an
uppercase script to transform the output. The
transformation rule is not a direct transformation for each
lemma, it can an alignment of words and can ignore
some English words like (the, in, a, an, etc.).
3.3 Transformations rules
Not all transformation rules used to transform English
data were verified by experts in linguistics. We validate
only 800 rules and transformation rules for lemma. We
cannot validate all rules because there exist an infinite
number of rules. For this reason, we developed an
application that offer to experts to enter their rules from
an English sentence, without coding. The application is
just a simple user interface that contains lemma
transformation rule, and the expert will compose lemma.
After that, he save the result and rebuild the corpora. The
built corpus is a made by a collaborative approach and
validated by experts.
3.4 Collecting data from Gutenberg
Acquisition of a parallel corpus for the use in a statistical
analysis typically takes several pre-processing steps. In
our case, there isn’t enough data between English texts
and American Sign Language. We start collecting only
English data from Gutenberg Project toward transform it
to ASL gloss. Gutenberg Project (Lebert, 2008) offers
over 38K free ebooks and more than 100K ebook
through their partners. Collecting task is made in five
steps:
Obtain the raw data (by crawling all files in the FTP
directory).
Extract only English texts, because there exist ebook
in others languages than English like German,
Spanish. We found also files containing ADN
sequences.
Break the text into sentences (sentence splitting task).
Prepare the corpora (normalization, tokenization).
In the following, we will describe in detail the
pre-processing steps to clean collected data.
3.5 Sentence splitting, tokenization, chunking
and parsing
Sentence splitting and tokenization require specialized
tools for English texts. One problem of sentence splitting
is the ambiguity of the period “.” as either an end of
sentence marker, or as a marker for an abbreviation. For
English, we semi-automatically created a list of known
abbreviations that are typically followed by a period.
Issues with tokenization include the English merging of
words such as in “can’t” (which we transform to can
not”), or the separation of possessive markers (“the
man’s” becomes “the man ’s”). We use also an available
tool for splitting called Splitta (Gillick, 2009). The
models are trained from Wall Street Journal news
combined with the Brown Corpus which is intended to
be widely representative of written English. Error rates
on test news data are near 0.25%. Also, we use CoreNLP
tool (Toutanova & Manning, 2000; Klein & Manning,
2003). It is a set of natural language analysis tools which
can take raw English language text input and give the
base forms of words, their parts of speech.
153
3.6 Releases of the English-ASL Corpus
The initial release of this corpus consisted of data up to
September 2011. The second release added data up to
January 2012, increasing the size from just over 800
sentences to up to 800 million words in English. A
forthcoming third release will include data up to early
2013 and will have better tokenization and more words
in American Sign Language. For more details, please
check the website (Othman & Jemni, 2011).
5. Discussions and conclusion
We described the construction of the English-American
Sign Language corpus. We illustrate a novel method for
transforming an English written text to American Sign
Language gloss. This corpus will be useful for statistical
analysis for ASL. We present the first corpus for ASL
gloss that exceeds one hundred million of sentences
available for all researches and linguistics. During the
next phase of the ASLG-PC12 project, we expect to
provide both a richer analysis of the existing corpus and
others parallel corpus (like French Sign Language,
Arabic Sign Language, etc.). This will be done by first
enriching the rules through experts. Enrichment will be
achieved by automatically transforming the current
transformation rules database, and then validating the
results by hand.
6. References
Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A.,
Thangali, A., et al. (2010, May 22-23). Large Lexicon
Project: American Sign Language Video Corpus and
Sign Language Indexing/Retrieval Algorithms. "
Proceedings of the 4th Workshop on the
Representation and Processing of Sign Languages:
Corpora and Sign Language Technologies, LREC .
Bergman, B., & Mesch, J. (2004). ECHO data set for
Swedish Sign Language (SSL). Department of
Linguistics, University of Stockholm.
Bungeroth, J., Stein, D., Dreuw, P., Zahedi, M., & Ney,
H. (2006). A German Sign Language Corpus of the
Domain. Fifth International Conference on Language
Resources and Evaluation, (pp. 2000-2003). Genoa,
Italy.
Crasborn, O., Kooij, E. v., Nonhebel, A., & Emmerik, W.
(2004). ECHO data set for Sign Language of the
Netherlands (NGT). Department of Linguistics,
Radboud University Nijmegen.
Gillick, D. (2009). Sentence Boundary Detection and the
Problem with the U.S. Annual Conference of the North
American Chapter of the Association for
Computational Linguistics: Human Language
Technologies , pp. 241-244.
Jemni, M., & El Ghoul, O. (2007). An avatar based
approach for automatic interpretation of text to Sign
language. 9th European Conference for the
Advancement of the Assistive Technologies in Europe.
San Sebastian.
Klein, D., & Manning, C. D. (2003). Accurate
Unlexicalized Parsing. Proceedings of the 41st
Meeting of the Association for Computational
Linguistics , pp. 423-430.
Lebert, M. (2008). Project Gutenberg (1971-2008).
University of Toronto & Project Gutenberg.
Morrissey, S. (2008). Assistive translation technology
for deaf people: translating into and animating Irish
sign language. 12th International Conference on
Computers Helping People with Special Needs. Linz.
Morrissey, S., & Way, A. (2007). Joining hands:
developing a sign language machine translation
system with and for the deaf community. Conference
and Workshop on Assistive Technologies for People
with Vision and Hearing Impairments: Assistive
Technology for All Ages. Granada.
Morrissey, S., Somers, H., Smith, R., Gilchrist, S., &
Dandapat, S. (2010, May). Building a Sign Language
corpus for use in Machine Translation. Proceedings of
the 4th Workshop on Representation and Processing of
Sign Languages: Corpora for Sign Language
Technologies , pp. 172-177.
Othman, A., & Jemni, M. (2011). American Sign
Language Gloss Parallel Corpus 2012 (ASLG-PC12).
Retrieved from http://www.achrafothman.net/aslsmt
Prillwitz, S., & Zienert, H. (1990). Hamburg notation
system for sign language: Development of a sign
writing with computer application. nternational
Studies on Sign Language and Communication of the
Deaf (pp. 355–379). Hamburg, Germany: Signum
Press.
Stokoe, W. (1960). Sign Language Structure: An Outline
of the VisualCommunication Systems of the American
Deaf. Linstok Press, SilverSpring.
Sutton, V., & Gleaves, R. (1995). SignWriter— the
world's first sign language processor. Deaf Action
Committee for SignWriting .
Toutanova, K., & Manning, C. D. (2000). Enriching the
Knowledge Sources Used in a Maximum Entropy
Part-of-Speech Tagger. the Joint SIGDAT Conference
on Empirical Methods in Natural Language
Processing and Very Large Corpora , pp. 63-70.
Woll, B., Sutton-Spence, R., & Waters, D. (2004).
ECHO data set for British Sign Language (BSL).
Department of Language and Communication Science,
City University (London).
154
... Second, it introduces a deep learning model that translates natural language text to sign language gloss using a sequence-to-sequence approach. Third, this study evaluates the proposed models on the ASLGPC12 corpus [17,18]. Different metrics, such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation), are used to evaluate the performance outcomes. ...
... In [11], the authors also proposed a translation system based on transformer models. They evaluated their proposed work on the Phoenix-2014T [10] and ASLG-PC12 [17,18] corpora. Using Transformer on the Phoenix-2014T dataset, their proposed model achieved BLEU in the range of 1 to 4 grams with scores of 48.40%, 36.90%, ...
... The number of words appearing in test data but not training data is expressed as OOV. The ASLG-PC12 corpus [17,18] was proposed as a large parallel corpus between English written texts and American Sign Language gloss. The ASLG-PC12 is an 87,709-sentence bilingual corpus. ...
... Second, it introduces a deep learning model that translates natural language text to sign language gloss using a sequence-to-sequence approach. Third, this study evaluates the proposed models on the ASLGPC12 corpus [17,18]. Different metrics, such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation), are used to evaluate the performance outcomes. ...
... Shivam Gupta proposed in [17] the importance of facial emotions as it has many applications in computer vision. Facial expressions recognition also enhances human-robot interactions and reactions, and in trying to simulate it, The authors used static and real images to detect the facial reactions. ...
... Second, it introduces a deep learning model that translates natural language text to sign language gloss using a sequence-to-sequence approach. Third, this study evaluates the proposed models on the ASLGPC12 corpus [17,18]. Different metrics, such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation), are used to evaluate the performance outcomes. ...
... Shivam Gupta proposed in [17] the importance of facial emotions as it has many applications in computer vision. Facial expressions recognition also enhances human-robot interactions and reactions, and in trying to simulate it, The authors used static and real images to detect the facial reactions. ...
Article
Recently, facial recognition has been one of the most crucial technologies people need. Facial recognition has attracted a lot of the crowd; for example, it has been used in security on most modern devices. Using machine and deep learning, overall performance will be improved, and the identification accuracy will be more precise. We aim to discover how well these algorithms perform in classifying human facial expressions and whether or not we can depend on them. The steps are as follows. First, we embed the images from the dataset, then split the dataset into 70% training data and 30% testing data; after that, we apply five different algorithms: Support Vector Machine, K-nearest Neighbor, Logistic Regression, Naive Bayes, and Random Forest. Support Vector Machine achieved an accuracy rate of 36%, K-nearest Neighbor achieved an accuracy rate of 52.3%, Logistic regression achieved an accuracy rate of 64.2%, and Naive Bayes achieved an accuracy rate of 38.1%. Random Forest achieved an accuracy rate of 51.7%. The dataset used was a cleaned version of the FER13 dataset, which contains 16,780 images divided into five classes (angry, happy, neutral, disgust, and fear). The results show that Logistic Regression proved to be the most accurate classifier among the presented ones, with an F1-Score of 63.8% and an accuracy of 64.2%.
... Dataset. We employ three widely used benchmark datasets for sign language translation, namely, Phoenix2014T (Camgoz et al., 2018), CSL-Daily 2 (Zhou et al., 2021a), and ASLG-PC12 3 (Othman and Jemni, 2012), which are in Germain, Chinese and English, respectively. Statistics of the datasets are presented in Table 1. ...
Preprint
Full-text available
Sign language gloss translation aims to translate the sign glosses into spoken language texts, which is challenging due to the scarcity of labeled gloss-text parallel data. Back translation (BT), which generates pseudo-parallel data by translating in-domain spoken language texts into sign glosses, has been applied to alleviate the data scarcity problem. However, the lack of large-scale high-quality domain spoken language text data limits the effect of BT. In this paper, to overcome the limitation, we propose a Prompt based domain text Generation (PGEN) approach to produce the large-scale in-domain spoken language text data. Specifically, PGEN randomly concatenates sentences from the original in-domain spoken language text data as prompts to induce a pre-trained language model (i.e., GPT-2) to generate spoken language texts in a similar style. Experimental results on three benchmarks of sign language gloss translation in varied languages demonstrate that BT with spoken language texts generated by PGEN significantly outperforms the compared methods. In addition, as the scale of spoken language texts generated by PGEN increases, the BT technique can achieve further improvements, demonstrating the effectiveness of our approach. We release the code and data for facilitating future research in this field.
... How2Sign (Duarte et al., 2021), NCSGLR (Databases, 2007), ASLG-PC 12 (Othman and Jemni, 2012), CopyCat (Zafrulla et al., 2010), RWTH-BOSTON-400 and RWTH-BOSTON-104 (Dreuw et al., 2008;Dreuw et al., 2007). ...
Preprint
Full-text available
We are releasing a dataset containing videos of both fluent and non-fluent signers using American Sign Language (ASL), which were collected using a Kinect v2 sensor. This dataset was collected as a part of a project to develop and evaluate computer vision algorithms to support new technologies for automatic detection of ASL fluency attributes. A total of 45 fluent and non-fluent participants were asked to perform signing homework assignments that are similar to the assignments used in introductory or intermediate level ASL courses. The data is annotated to identify several aspects of signing including grammatical features and non-manual markers. Sign language recognition is currently very data-driven and this dataset can support the design of recognition technologies, especially technologies that can benefit ASL learners. This dataset might also be interesting to ASL education researchers who want to contrast fluent and non-fluent signing.
... American sign language dataset ASLG-PC12 [23] and Korean sign language dataset KETI [10]. ...
Preprint
Full-text available
Sign Language Translation (SLT) is a task that has not been studied relatively much compared to the study of Sign Language Recognition (SLR). However, the SLR is a study that recognizes the unique grammar of sign language, which is different from the spoken language and has a problem that non-disabled people cannot easily interpret. So, we're going to solve the problem of translating directly spoken language in sign language video. To this end, we propose a new keypoint normalization method for performing translation based on the skeleton point of the signer and robustly normalizing these points in sign language translation. It contributed to performance improvement by a customized normalization method depending on the body parts. In addition, we propose a stochastic frame selection method that enables frame augmentation and sampling at the same time. Finally, it is translated into the spoken language through an Attention-based translation model. Our method can be applied to various datasets in a way that can be applied to datasets without glosses. In addition, quantitative experimental evaluation proved the excellence of our method.
Article
Sign Languages (SLs) are employed by deaf and hard-of-hearing (DHH) people to communicate on a daily basis. However, the communication with hearing people still faces some barriers, mainly because of the scarce knowledge about SLs among hearing people. Hence, tools to allow the communication between users of either sign or spoken languages must be encouraged. A stepping stone in this direction is the research of the sign language translation (SLT) task, which aims to produce a spoken language translation of a sign language video or vice versa. By implementing these types of translators in portable devices, we will make considerable progress towards a barrier-free communication between DHH and hearing people. That is why, in this work, we focus on reviewing the literature on SLT and provide the necessary background about SLs. Besides, we summarise the available datasets and the results found in the literature for one of the most used datasets, the RWTH-PHOENIX-2014T. Moreover, the survey lists the challenges that need to be tackled within the SLT research and also for the adoption of SLT technologies, and proposes future research lines.
Article
Full-text available
https://books.google.tn/books?id=aufgV3lz2XwC&lpg=PA266&ots=7h1R6yE76C&dq=info%3ACS6QePizhxEJ%3Ascholar.google.com&lr&pg=PA268#v=onepage&q&f=true
Article
Full-text available
All systems for automatic sign language translation and recognition, in particular statistical systems, rely on adequately sized corpora. For this purpose, we created the Phoenix corpus that is based on German television weather reports translated into German Sign Language. It comes with a rich annotation of the video data, a bilingual text-based sentence corpus and a monolingual German corpus.
Article
Full-text available
In recent years data-driven methods of machine translation (MT) have overtaken rule-based approaches as the predominant means of automatically translating between languages. A pre-requisite for such an approach is a parallel corpus of the source and target languages. Technological developments in sign language (SL) capturing, analysis and processing tools now mean that SL corpora are becoming increasingly available. With transcription and language analysis tools being mainly designed and used for linguistic purposes, we describe the process of creating a multimedia parallel corpus specifically for the purposes of English to Irish Sign Language (ISL) MT. As part of our larger project on localisation, our research is focussed on developing assistive technology for patients with limited English in the domain of healthcare. Focussing on the first point of contact a patient has with a GP's office, the medical secretary, we sought to develop a corpus from the dialogue between the two parties when scheduling an appointment. Throughout the development process we have created one parallel corpus in six different modalities from this initial dialogue. In this paper we discuss the multi-stage process of the development of this parallel corpus as individual and interdependent entities, both for our own MT purposes and their usefulness in the wider MT and SL research domains.
Article
Neutrophils are usually the first blood cells to enter inflammatory lesions. They accumulate in high numbers, and perform defence functions that often lead to tissue damage as a consequence of release of lytic enzymes and oxygen-derived radicals. Like other leukocytes, the circulating neutrophils are in a resting state and are recruited into inflamed tissues by chemotactic stimuli. Several types of chemotactic agonists are known. Their formation in the tissues depends on the type of inflammatory injury. A single type of agonist may act initially, but the recruitment process usually depends on several agonists which can act in concert since they bind to distinct receptors. Once the neutrophils have migrated into a diseased tissue, phagocytosis usually concurs in the triggering of product release.
Conference Paper
Sentence Boundary Detection is widely used but often with outdated tools. We discuss what makes it difficult, which features are relevant, and present a fully statistical system, now pub- licly available, that gives the best known er- ror rate on a standard news corpus: Of some 27,000 examples, our system makes 67 errors, 23 involving the word "U.S."
Article
Machine Translation (MT) for sign languages (SLs) can facilitate communication between Deaf and hearing people by translating information into the native and preferred language of the individuals. In this paper, we discuss automatic translation from English to Irish SL (ISL) in the domain of airport information. We describe our data collection processes and the architecture of the MaTrEx system used for our translation work. This is followed by an outline of the additional animation phase that transforms the translated output into animated ISL. Through a set of experiments, evaluated both automatically and manually, we show that MT has the potential to assist Deaf people by providing information in their first language.
Article
Studies conducted in many laboratories over the past several years have resulted in the identification and initial characterization of a large superfamily of structurally and functionally related inflammatory cytokines. This superfamily currently includes 14 distinct members: platelet factor 4, beta-thromboglobulin, neutrophil activating peptide-1/interleukin-8, gro, IP-10, mig, ENA-78, macrophage inflammatory proteins-1 alpha and -1 beta, monocyte chemoattractant protein-1/JE, RANTES, HC-14, C-10, and I-309. Although numerous biological activities have been assigned to these molecules, a common theme is their ability to stimulate the chemotactic migration of distinct sets of cells, including neutrophils, monocytes, lymphocytes, and fibroblasts. Accumulating evidence indicates that these molecules play important roles in mediating cell recruitment and activation necessary for inflammation and the repair of tissue damage.
Idiopathic pulmonary fibrosis is an immunologically mediated pulmonary disorder in which activated alveolar macrophages (AM) and neutrophils play cardinal roles in the pathogenesis of the inflammatory lung lesion. The factors responsible for the induction and perpetuation of the neutrophilic alveolitis are not known. Recently, a novel cytokine (Interleukin-8) was described that is released by activated mononuclear phagocytes and a variety of other cell types, and it exhibits potent chemotactic activity for polymorphonuclear leukocytes (PMN). Increased expression of IL-8 has been described in other inflammatory disorders characterized by neutrophilic infiltration, including psoriasis, rheumatoid arthritis, and the sepsis syndrome, but no studies have assessed this cytokine in the context of interstitial pulmonary disorders. We have previously shown that normal human AM release IL-8 upon appropriate stimulation, but data assessing the expression of IL-8 by human AM in specific pulmonary disease states are lacking. In this study, we examined the expression of steady-state mRNA for IL-8 by human alveolar macrophages obtained by bronchoalveolar lavage (BAL) from patients with idiopathic pulmonary fibrosis (IPF) or sarcoidosis and from healthy volunteers. Because it is known that adherence to plastic culture plates may up-regulate gene expression for IL-8 in the absence of additional stimulation, we extracted mRNA immediately from the cell pellet obtained by BAL rather than using cultured alveolar macrophage monolayers. Northern blot analysis was performed to determine IL-8 mRNA expression. We found that BAL cells from patients with IPF constitutively expressed mRNA for IL-8, and the amount of IL-8 mRNA (as assessed by laser densitometry) correlated with the percent of neutrophils on BAL.(ABSTRACT TRUNCATED AT 250 WORDS)
Article
IL-8, a potent neutrophil-activating protein, can be produced by many cell types including monocytes, lymphocytes, fibroblasts, neutrophils, and endothelial cells. Depending on the cell source, the N-terminal amino acid sequence of IL-8 displays heterogeneity that has been shown to confer differences in its neutrophil stimulatory activity in vitro. Despite these observations the relative potency of different IL-8 molecules in vivo is unknown. To address this question we have investigated the biologic activity of the two predominant forms of IL-8, the 72 and the 77 amino acid proteins, in vitro and in vivo. In vitro, human rIL-8(72) and human rIL-8(77) dose dependently induced adherence of rabbit peritoneal neutrophils and human neutrophils to laminin-coated plates and elevated cytoplasmic levels of Ca2+ ([Ca2+]i) in fura-2 loaded neutrophils. In these in vitro assays human rIL-8(72) was more potent than human rIL-8(77) while inducing comparable responses to human rC5a. With respect to enhancing [Ca2+]i, neutrophils desensitized to human rIL-8(72) failed to respond to human rIL-8(77). However, neutrophils fully desensitized to human rIL-8(77) could exhibit a partial response to human rIL-8(72). Further, human rIL-8(72) was approximately 10-fold more effective than human rIL-8(77) in displacing human [125I]rIL-8(72) from rabbit peritoneal neutrophils in a receptor-binding assay. In vivo, intradermally administered human rIL-8(72) and human rIL-8(77) induced 111In-neutrophil accumulation and edema formation in rabbit skin. In contrast to the in vitro studies, the two forms of IL-8 gave identical responses in vivo although they were less potent than human rC5a. Our results demonstrate that, in vitro, human rIL-8(72) is more potent than human rIL-8(77) in stimulating neutrophils. It may be that IL-8)72) has a greater affinity and/or efficacy for the neutrophil IL-8 cell-surface receptors. One possibility for the observation that both forms of IL-8 are equipotent in inducing inflammatory responses in vivo is that the extended form is proteolytically cleaved to the more biologically active IL-8(72).
Article
Injury to cartilage is a recognized sequela of neutrophil activation in arthritic joints. This study examined the possibility that chondrocytes may play a direct role in intraarticular neutrophil activation. We demonstrate that IL-1 beta-stimulated primary and subcultured human articular chondrocytes, express the gene for the potent neutrophil chemotactic and activating cytokine, IL-8. Expression of IL-8 mRNA is also inducible by TNF-alpha and LPS and, to a lesser degree, by the chondrocyte growth factor, transforming growth factor-beta, but not by platelet-derived growth factor, acidic and basic fibroblast growth factor, or epidermal growth factor. Analysis of IL-1 beta-stimulated cartilage organ cultures by in situ hybridization demonstrates that chondrocytes in all zones of cartilage are rapidly induced to express the IL-8 gene in high copy number. Metabolically labeled IL-1 beta-stimulated chondrocytes synthesize IL-8 de novo, which comigrates on SDS-PAGE with IL-8 produced by synovial fibroblasts. Furthermore, the conditioned media of IL-1 beta-stimulated chondrocytes and cartilage organ cultures contain neutrophil chemotactic activity which is completely neutralized by a specific antibody to IL-8, establishing that a bioactive form of IL-8 is the major secreted neutrophil chemotactic factor. By using a specific RIA, we demonstrate that not only IL-1 beta, but also TNF-alpha and LPS can induce abundant IL-8 secretion from chondrocytes. In conclusion, articular chondrocytes are readily inducible to express the IL-8 gene and secrete biologically active IL-8 which can promote neutrophil-mediated inflammation and cartilage destruction.