ThesisPDF Available

Development and Evaluation of Two Vocabulary Tests for Adult Learners of Swiss German Sign Language

Authors:

Abstract and Figures

With the alignment of Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) curricula in tertiary education to the Common European Framework of Reference for Languages (CEFR), the need for reliable and valid assessment instruments for adult learners arose. With the focus on beginning adult learners of DSGS and on the notion of vocabulary knowledge contributing to overall language proficiency, two vocabulary size tests for DSGS have been developed and evaluated. The dissertation at hand reports on this work. The first test is a web-delivered self-report, the second test represents a translation test from German to DSGS, which includes video-recording for later analysis. For both tests, the same set of items is used. The items have been sampled from existing DSGS teaching materials. For the main study, 20 DSGS adult learners of ages 24 to 55 (M = 39.3) were recruited as test takers. An item analysis of the test results yielded candidates for removal from the item set. An analysis of internal consistency (Cronbach’s Alpha) showed good results for both tests (>.90). Similarly, inter-rater reliability (Cohen’s Kappa) of the translation test indicated promising results. Evidence contributing to content and face validity was collected based on the sampling method of the test items and feedback from the test takers, respectively. Due to the lack of a second DSGS vocabulary test that could be used to establish concurrent validity, external variables (self-assessment of DSGS skills, number of DSGS courses attended, DSGS learning contexts) were identified and investigated as possible external criteria contributing to the performance of the test takers. Only one variable, number of DSGS courses attended, showed a statistically significant correlation with the test results. The results of this study confirm findings from studies of spoken language assessment and will contribute to future research in sign language testing and assessment.
Content may be subject to copyright.
Lancaster University
Department of Linguistics and English Language
MA in Language Testing (by Distance)
Development and Evaluation of Two Vocabulary Tests for Adult
Learners of Swiss German Sign Language
Tobias Haug
6. September 2017
[The appendices (90 pages) have been removed by the author. When you have
questions about specific appendices, please send me an email at
tobias.haug@slas.ch]
Word count: 19,998
!
ii
Abstract
With the alignment of Swiss German Sign Language (Deutschschweizerische
Gebärdensprache, DSGS) curricula in tertiary education to the Common European
Framework of Reference for Languages (CEFR), the need for reliable and valid
assessment instruments for adult learners arose.
With the focus on beginning adult learners of DSGS and on the notion of
vocabulary knowledge contributing to overall language proficiency, two vocabulary
size tests for DSGS have been developed and evaluated. The dissertation at hand reports
on this work. The first test is a web-delivered self-report, the second test represents a
translation test from German to DSGS, which includes video-recording for later
analysis. For both tests, the same set of items is used. The items have been sampled from
existing DSGS teaching materials.
For the main study, 20 DSGS adult learners of ages 24 to 55 (M = 39.3) were
recruited as test takers. An item analysis of the test results yielded candidates for
removal from the item set. An analysis of internal consistency (Cronbach’s Alpha)
showed good results for both tests (>.90). Similarly, inter-rater reliability (Cohen’s
Kappa) of the translation test indicated promising results. Evidence contributing to
content and face validity was collected based on the sampling method of the test items
and feedback from the test takers, respectively. Due to the lack of a second DSGS
vocabulary test that could be used to establish concurrent validity, external variables
(self-assessment of DSGS skills, number of DSGS courses attended, DSGS learning
contexts) were identified and investigated as possible external criteria contributing to
the performance of the test takers. Only one variable, number of DSGS courses attended,
showed a statistically significant correlation with the test results.
The results of this study confirm findings from studies of spoken language
assessment and will contribute to future research in sign language testing and
assessment.
!
iii
Acknowledgement
First of all, I would like to thank my supervisor Aaron Batty for his continuous
support and feedback via Skype and e-mail over the last months.
Furthermore, I would like to express my gratitude to the participants of the pilot
and main study; the study would not have been possible without them. My thanks also
go to the sign language instructors who participated in the focus group to provide
feedback on the two tests. I would also like to thank partner organizations like the Swiss
Deaf Association, the professional organization of sign language instructors (BGA), the
schools for the Deaf in German Switzerland, and individuals who were very supportive
in the process of recruiting the study participants.
Additionally, I would like to thank my two Deaf colleagues Katja Tissi and
Sandra Sidler-Miserez for their invaluable contribution towards developing the two
tests. I am also indebted to my colleagues in the SMILE project for using the list of
vocabulary items for my research project. Thanks also to Penny Boyes Braem for
providing very useful feedback at different stages of the project. I also thank all the other
individuals who provided feedback at some point of this project.
My biggest thanks go to my wife Sarah. Sarah was very supportive during the
whole process of completing my dissertation, she was also an important person for
discussing aspects of my project along the way, as we share the interest for sign
languages. Thank you so much!
!
iv
Table of Contents
1. Introduction ............................................................................................................ 1
2. Literature Review .................................................................................................... 4
2.1 Vocabulary Knowledge and Assessment ........................................................... 4
2.1.1 Dimensions of vocabulary assessment ........................................................ 4
2.1.2 Basic unit of vocabulary assessment ........................................................... 5
2.1.3 Aspects of vocabulary knowledge: Size and depth ...................................... 6
2.1.4 Receptive and productive skills ................................................................... 6
2.2 Different Forms of Vocabulary Assessment Instruments.................................... 8
2.2.1 Yes/No tests ............................................................................................... 8
2.2.2 Translation tests and scoring issues ........................................................... 11
2.3 Structure, Acquisition, Learning, and Assessment of Sign Languages, with
Particular Emphasis on DSGS ............................................................................... 11
2.3.1 Basic structure of sign languages .............................................................. 12
2.3.2 Sign language acquisition and learning ..................................................... 17
2.3.3 Sign language tests ................................................................................... 17
2.4 Test Takers’ Feedback in Language Testing Research ..................................... 18
2.5 Reliability and Validity ................................................................................... 19
2.6 Chapter Summary and Research Questions ...................................................... 20
3. Methodology ......................................................................................................... 21
3.1 Development of Instruments ............................................................................ 22
3.1.1 Item selection for the two vocabulary tests ................................................ 23
3.1.2 Yes/No Test for DSGS ............................................................................. 25
3.1.3 L1/L2 Translation Test and scoring instrument for DSGS ......................... 27
3.1.4 Background questionnaires ....................................................................... 29
3.1.5 Feedback questionnaire for test takers ....................................................... 30
3.1.6. Consent forms and information about the project ..................................... 30
3.2 Recruiting Study Participants: Test Takers and Raters ..................................... 31
3.3 Pilot Studies .................................................................................................... 31
3.3.1 Pilot study 1.............................................................................................. 31
3.3.2 Pilot study 2.............................................................................................. 33
3.4 Main Study ...................................................................................................... 34
3.4.1 Participants ............................................................................................... 34
3.4.2 Procedure ................................................................................................. 35
3.5 Data Analysis .................................................................................................. 35
3.5.1 Statistical assumptions .............................................................................. 35
3.5.2 Descriptive statistics of the test scores ...................................................... 37
3.5.3 Item analysis of the test scores .................................................................. 37
3.5.4 Evidence of reliability in the two tests ...................................................... 38
3.5.5 Evidence of validity in the two tests .......................................................... 38
3.5.6 Variables contributing to the performance of the test takers in the Yes/No
Test and Translation Test .................................................................................. 39
3.5.7 Comparison of the performances of the test takers on both tests ................ 40
3.5.8 Nonsense signs of the Yes/No test ............................................................ 40
4. Results .................................................................................................................. 41
4.1 Descriptive Statistics of the Test Scores .......................................................... 41
4.1.1 Descriptive statistics of the YN Test scores............................................... 41
4.1.2 Descriptive statistics of the Translation Test scores................................... 41
4.2 Item Analysis of the Test Scores ...................................................................... 42
4.2.1 Item analysis of the YN Test scores .......................................................... 42
!
v
4.2.2 Item analysis of the Translation Test scores .............................................. 42
4.3 Evidence of Reliability in the Two Tests ......................................................... 43
4.3.1 Internal consistency of the two tests .......................................................... 43
4.3.2 Inter-rater reliability of the Translation Test .............................................. 44
4.4 Feedback from Test Takers.............................................................................. 45
4.5 Evidence Contributing to the Vocabulary Tests’ Validity: Performance of the
Test Takers on the Yes/No Test in Relation to External Variables ......................... 47
4.5.1 Test takers’ self-assessment in relation to YN Test scores ......................... 47
4.5.2 Test takers’ number of courses attended in relation to YN Test scores ...... 47
4.5.3 Test takers’ contexts of learning in relation to YN Test scores .................. 48
4.6 Evidence Contributing to the Vocabulary Tests’ Validity: Performance of the
Test Takers on the Translation Test in Relation to External Variables ................... 48
4.6.1 Test takers’ self-assessment in relation to Translation Test scores ............. 48
4.6.2 Test takers’ number of courses attended in relation to Translation Test
scores ................................................................................................................ 49
4.6.3 Test takers’ contexts of learning in relation to Translation Test scores ...... 49
4.7 Comparison of the Performances of the Test Takers on both Vocabulary Tests 50
4.8 Nonsense Signs of the YN Test: False Alarm Rate .......................................... 51
5. Discussion ............................................................................................................. 52
5.1 Item Analysis of the YN Test and Translation Test .......................................... 52
5.2 Evidence of Reliability in the Two Tests ......................................................... 53
5.2.1 Measure of internal consistency: Cronbach’s Alpha .................................. 53
5.2.2 Inter-rater reliability: Cohen’s Kappa ........................................................ 54
5.3 Evidence of Validity in the Two Tests ............................................................. 55
5.3.1 Content validity ........................................................................................ 55
5.3.2 Face validity ............................................................................................. 55
5.4 Test Performances on the Two Tests ............................................................... 56
5.4.1 External variables contributing to the test takers’ performance .................. 56
5.4.2 Comparing test scores of both tests ........................................................... 56
5.5 Effectiveness of the Nonsense Signs ................................................................ 57
6. Conclusion and Implications for Future Research .................................................. 58
7. References ............................................................................................................ 60
!
vi
Table of Figures
Figure 1: The lexical sign AUTO (‘car’) (Boyes Braem, 1995; reprinted with permission
of the author) ..................................................................................................................... 13!
Figure 2: The six unmarked handshapes (Boyes Braem, 1995; reprinted with permission
of the author) ..................................................................................................................... 14!
Figure 3: Model of sign language lexicon (according to Johnston & Schembri, 2007)
............................................................................................................................................ 15!
Figure 4: Yes/No Test Practice Item 1 WIDERSPRUCH (‘contradiction’) (post-pilot
version with slightly changed wording of the task) ........................................................ 27!
Figure 5: Sample slide from the L1/L2 Translation Test showing Item 10, the German
word “Strasse” (‘street’) and the sentence “Diese Strasse ist neu.” (‘This street is new.’)
............................................................................................................................................ 28!
!
vii
List of Tables
Table 1: Overview of Studies and Instruments, with Section Numbers ........................ 21!
Table 2: Number of Courses Attended by Test Takers (n = 17, 3 Missing Values)..... 34!
Table 3: Landis and Koch Kappa Benchmark Scale (from Gwet, 2014, p. 166) ......... 38!
Table 4: Cronbach’s Alpha in the YN Test and the Translation Test Scores ............... 43!
Table 5: Agreement and Disagreement of the Two Raters on the Translation Test ..... 44!
Table 6: Correlation of Statement 3 “The test matches my DSGS skills well” and Test
Scores, by Test .................................................................................................................. 46!
Table 7: Correlation Translation Test Scores and Self-Assessment, by Rater (N = 20)
............................................................................................................................................ 48!
Table 8: Correlation Translation Test Scores and Number of Courses Attended, by Rater
(n = 17) .............................................................................................................................. 49!
Table 9: Correlation Translation Test Scores and Contexts of Learning, by Rater (N =
20) ...................................................................................................................................... 49!
Table 10: Mean Scores and Percentages of YN Test and Translation Test (max. Score
on Both Tests = 98) ........................................................................................................... 50!
Table 11: Correlation YN Test Scores and Translation Test Scores, by Rater (N = 19)
............................................................................................................................................ 50!
Table 12: Hits and False Alarms of the YN Test (N = 19) ............................................. 51!
Table 13: Items to be Removed from the Translation Test and YN Test...................... 52!
!
1
1. Introduction
The implementation and the use of the Common European Framework of
Reference for Languages (CEFR; Council of Europe, 2009) is a rather new development
in the field of sign language learning as a second or foreign language in tertiary
education. It has only been with recent attempts to align sign language curricula to the
CEFR that the development of assessment instruments to evaluate adult learners of a
sign language has become more visible. Evidence for this are European projects such as
D-Signs (Leeson & Grehan, 2009) or ProSign: Sign Language for Professional
Purposes (Leeson, Haug, Rathmann, Van den Bogaerde, & Sheneman, submitted), a
project taking place at the European Centre for Modern Languages in Graz.
We find evidence also in the German-speaking part of Switzerland, where
aligning existing Swiss German Sign Language (Deutschschweizerische
Gebärdensprache, DSGS) curricula to the CEFR has become an important topic and,
subsequently, the assessment of adult learners has gained more attention. In
Switzerland, three different sign languages are used (Boyes Braem, Haug, & Shores,
2012). In the German-speaking area, DSGS is the primary language of approximately
5,500 Deaf
1
sign language users and a second language to approximately 13,000 hearing
persons (Boyes Braem, 2012a). The group of hearing learners include hearing children
of Deaf adults, sign language interpreters, teachers for the Deaf as well as people who
use DSGS due to a wide variety of other professional or personal reasons.
There are different DSGS course providers in the German-speaking part of
Switzerland. By far the largest provider is the Swiss Deaf Association, the national
umbrella organization for the Deaf, which offers standard courses on seven different
levels. Each level corresponds to 30 contact hours that are spread out over one semester.
In addition, the Swiss Deaf Association also provides special smaller courses that vary
in their number of contact hours. DSGS learning also takes place in other contexts, for
example, in the sign language interpreting training program in Zurich or at schools for
Deaf children, where in-service teachers receive further training in DSGS.
Implementing the CEFR in these different contexts is an ongoing process with
collaborations between the different stakeholders. Currently, the most pressing issue
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
1
It is a widely-recognized convention to use upper case Deaf for describing members of the linguistic
community of sign language users and, in contrast, the lower case deaf for describing individuals with an
audiological state of a hearing impairment, not all of whom might be sign language users (Morgan &
Woll, 2002).
!
2
regarding this effort is the lack of reliable and valid DSGS tests that could be used to
assess this varied group of learners.
One aspect that renders the development and research of sign language tests
difficult for many sign languages is that most sign languages are under-documented and
under-resourced and therefore have, e.g., no corpora or reference grammars at their
disposal. This is also true for DSGS: No balanced and representative DSGS corpus
exists, and no reference grammar has been developed. Examples of sign languages for
which corpora of various sizes have been compiled include British Sign Language
(BSL; Fenlon, Schembri, Rentelis, Vinson, & Cormier, 2014), Australian Sign
Language (Auslan; Johnston & Schembri, 2006), and German Sign Language (Deutsche
Gebärdensprache, DGS; Hanke, 2016, 2017). An example of a well-researched sign
language having a reference grammar is New Zealand Sign Language (NZSL). For
NZSL, test developers can draw on this reference grammar (McKee, 2015) for
operationalization of the underlying construct of a test. For all other sign languages,
however, no such reference grammar is available (Palfreyman, Sagara, & Zeshan,
2015). An example of a sign language that has been widely researched overall is
American Sign Language (ASL), for which there are studies available addressing
aspects such as basic word order (e.g., Valli & Lucas, 1995). A comparable amount of
research for DSGS is not available.
Possessing little scientific knowledge about a sign language poses a great
challenge for anyone constructing tests for the respective sign language (for a discussion
on methodological issues involved here, see Haug, 2011; Haug & Mann, 2008). One
approach that has been reported in the literature is for test developers to use research
from other sign languages to a certain degree as the basis for their test development
(e.g., Fehrmann, Huber, Jäger, Sieprath, & Werth, 1995a, 1995b).
Research on sign language test development and use is itself a rather young field
within the domains of sign linguistics and applied linguistics (Haug, 2015).
Consequently, there is of yet no tradition to frame sign language proficiency within a
model of communicative language ability (CLA) comparable to the longer tradition that
exists for spoken languages.
2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
2
Spoken languages are languages that are not signed, represented in written form or as speech.
!
3
The view on linguistic competence in the Chomskian sense as the knowledge of
a formal linguistic system at the level of grammar has been extended to include also the
application of this knowledge in language performance or actual language use (e.g.,
Brown & Hudson, 2002). In the course of this development, different theoretical models
of communicative competence have been proposed and developed in the last decades
within the frameworks of foreign language learning and assessment (e.g., Bachman,
1990; Canale & Swain, 1980). Even though these models differ to a certain degree, there
seems to be an agreement that CLA is multi-componential and that different components
might develop differently in learners of a language (Purpura, 2008). Within a CLA
model, vocabulary knowledge is considered part of a larger construct and not seen as a
separate ability of overall language proficiency (e.g., Read & Chapelle, 2001). The
importance of vocabulary knowledge has thus been recognized in the literature for
second language acquisition and assessment (Beglar & Nation, 2014; Read, 2000) as a
crucial variable contributing overall to learners’ language proficiency (e.g., Koizumi &
In’nami, 2013; Meara & Fitzpatrick, 2000; Read & Chapelle, 2001).
The focus in the present study is on developing a vocabulary test for DSGS. This
was motivated by a general interest by the researcher in vocabulary assessment, but also
by the opportunity to use an existing list of vocabulary items, which were sampled from
existing DSGS teaching materials and target learners on the CEFR level A1. Therefore,
the goals of this MA dissertation project are: (1) to develop two DSGS vocabulary tests
for CEFR level A1 and (2) to evaluate these tests.!
!
!
4
2. Literature Review
Prior to the development of any language test, its purpose must be defined. The
purpose of a test can be, for example, that of placement into a specific language course,
or a test can represent an achievement test at the end of a semester, a progress test during
the semester at a university, or a diagnostic test to identify areas in the language skills
of students that need more attention (Brown & Abeywickrama, 2010). The purpose of
the DSGS vocabulary tests that are the subject of this MA dissertation is to function as
placement and diagnostic tests for DSGS learners not only at the university level but
also in other learning contexts in German Switzerland.
A fundamental goal of language tests is to “make inferences about language
ability” of learners (Bachman & Palmer, 1996, p. 95). These inferences might lead to
decisions that have different types of impact on people, some being high-stakes
decisions, such as hiring someone or admitting someone into a training program, and
some being low-stakes decisions, such as passing a student in an exam at the end of a
language course (Bachman & Palmer, 1996).
The specific aspects from the literature that are taken into consideration for the
development of the DSGS vocabulary tests here will be discussed in the remainder of
this chapter, which is organized in the following way: In Section 2.1, key concepts and
issues of vocabulary assessment are presented and, in Section 2.2, different vocabulary
test formats. In Section 2.3, the basic structure of DSGS insofar as is relevant for the
vocabulary test is discussed, followed by Section 2.4 with a discussion of test takers’
involvement in language testing research, Section 2.5 with issues of reliability and
validity, and Section 2.6 with a summary and the specific research questions.
2.1 Vocabulary Knowledge and Assessment
2.1.1 Dimensions of vocabulary assessment
Read (2000) discusses two contrasting perspectives of vocabulary assessment,
the first of which is that vocabulary should be assessed in an integrated manner, i.e.,
embedded in language use tasks, framed within a model of CLA (e.g., Bachman, 1990).
The second and contrasting perspective is that of treating words as isolated units, where
single words are assessed by means of discrete-point testing. Today, discrete-point
testing (or rather, its underlying construct) is criticized for contradicting the current
concept described above of a broader view on language proficiency within a model of
CLA.
!
5
Read (2000) outlines three dimensions as being relevant for vocabulary
assessment. The first dimension deals with the underlying construct of vocabulary
assessment, which can be discrete or embedded. The construct of discrete tests frames
vocabulary knowledge as a distinct entity, separate from other language abilities. In
contrast, embedded vocabulary assessments are part of a larger construct. For example,
Bachman and Palmer (1996) address the construct of academic writing, where test
takers write a letter and vocabulary is one aspect among others on an analytic scale that
is used for scoring. The second of Read (2000)’s three dimensions distinguishes
between selective and comprehensive tests. Selective vocabulary tests focus on a
specific set of items from a word list, while comprehensive tests constitute, for example,
a writing task in which the test takers’ entire vocabulary range is evaluated, not just a
predetermined set of vocabulary items. The third dimension categorizes tests as context-
independent or context-dependent. In a context-independent measure, test takers
respond to an item without any context; examples are simple translation tasks or picture-
naming tasks. In contrast, context-dependent tests require the test taker to “take account
of contextual information in order to produce the expected response” (Read, 2000, p.
9).
2.1.2 Basic unit of vocabulary assessment
Prior to developing a vocabulary test, it is important to define what the basic unit
is that will be evaluated in the test. As Read (2000) points out, English words like the,
a, to, and (i.e., articles, prepositions, pronouns, and conjunctions) are function words
that belong to the grammar rather than to the vocabulary of a language. Content words,
like nouns, verbs, adjectives, and adverbs, are examples of vocabulary. Another
important distinction concerns the terms type and token. While types are the different
(unique) word forms (or signs) of a corpus,
3
tokens are all occurrences of word forms of
the corpus (Nation, 2016; Read, 2000). A further distinction is that between word forms
and word families: Word families consist of the base form of, for example, an English
word like leak as well as its inflected forms leaks and leaking and its derivation forms
(e.g., leaky, leakage), which can belong to different word classes (Read, 2000). This
distinction is important for measuring vocabulary size in that test developers need to
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
3
A corpus is, broadly speaking, a collection of texts (Sinclair, 2005), where text can refer to original
written language, transcriptions of speech, or transcriptions of signing.
!
6
decide whether they test word families or word forms when estimating the vocabulary
size of a learner. Read (2000) proposes word families as the basic unit for vocabulary
size tests. The basic units of sign languages will be discussed in Section 2.3.1.
2.1.3 Aspects of vocabulary knowledge: Size and depth
A distinction often applied to vocabulary knowledge is that between vocabulary
size and vocabulary depth (e.g., Read, 2000; Schmitt, 2014). While size seems to be
straightforward to define, namely as the number of words (or word families) a learner
knows (e.g., Meara, 1996; Read, 2000; Schmitt, 2014), and is often tested through
vocabulary size tests, vocabulary depth is defined as how well someone knows a word.
The concept of vocabulary depth is commonly divided into different parts (e.g., Read,
2000). A well-known approach, following Nation (2001), is to distinguish between
form, meaning, and use, with features of orthography, phonology, morphology, syntax,
collocations, and pragmatics, both with respect to receptive and productive skills.
Nation (2001) also addresses the issue of high- and low-frequency words, i.e.,
the more often a learner is exposed to a word or a sign, the earlier he or she will learn
it. This is relevant for test construction.
The distinction between vocabulary size and depth can also be applied to DSGS,
but with some modifications due to the absence of a widely accepted conventionalized
writing system for sign languages (Boyes Braem, 2012b). For the purpose of this
project, the focus will be on the development of a vocabulary size measures.
2.1.4 Receptive and productive skills
A further distinction that has often been applied is that between testing
vocabulary receptively and productively, which has traditionally meant the testing of
comprehension and production of a spoken or written form (Schmitt, 2014; Stewart,
2012). The terms active and passive are often used as synonyms for productive and
receptive, respectively (Nation, 2001).
Read (2000) proposes different kinds of distinctions, that between recognition
and recall on the one hand and that between comprehension and use on the other hand.
Recognition according to Read means “that the test-takers are presented with the target
word and are asked to show that they understand its meaning, whereas in the case of
recall they are provided with some stimulus designed to elicit the target word from their
memory” (Read, 2000, p. 155). An example of recall is where test takers see a word in
!
7
their first language (L1) and are asked to provide a corresponding word in their second
language (L2). The aspect of recognition and recall as part of word knowledge can be
tested through discrete, selective, and context-independent items (Read, 2000). Read
further distinguishes between comprehension and use. Comprehension is achieved as
soon as learners understand “a word when they encounter it in context while listening
or reading, whereas use means that the word occurs in their own speech or writing”
(Read, 2000, p. 156). The difference between recognition and recall on the one hand
and comprehension and use on the other is given by the context, which is a word-level
context in the former case and a broader context in the latter.
Another approach is to not look at reception and production but at the strength
of vocabulary knowledge (Laufer & Goldstein, 2004; Laufer, Elder, Hill, & Congdon,
2004). For example, Laufer et al. (2004) describe how a translation task from a test
taker’s L2 into his or her L1 can be considered as testing active knowledge, since
“supplying the meaning requires the test-taker to produce something” (p. 205).
However, in some circumstances, this also corresponds to testing passive knowledge,
“since the test-takers are required only to demonstrate comprehension of the meaning
of the word, rather than availability of the word form within their repertoire” (p. 205).
In order to overcome this dichotomy, Laufer and Goldstein (2004) suggest four different
categories of strength of knowledge, which represent a hierarchy from the weakest to
the strongest form:
(1) Passive or meaning recognition (prompt L2 word, select its meaning from
four options in the learner’s L1),
(2) Active or form recognition (prompt L1 word, select among four L2 words,
one target and three distractors)
(3) Passive or meaning recall (prompt L2 word, L1 translation)
(4) Active or form recall (prompt L1 word, translation into L2).
This theoretical hierarchy of strength of vocabulary knowledge was confirmed
in an empirical study by Laufer and Goldstein (2004).
For the purpose of the present study, the distinction between receptive/passive
and productive/active vocabulary testing will be used as introduced at the beginning of
this section.
!
8
2.2 Different Forms of Vocabulary Assessment Instruments
Depending on whether receptive or productive vocabulary knowledge should be
tested, different forms of instruments are typically used. Frequently used forms for
vocabulary assessments are, for example, checklists (e.g., Yes/No tests), matching tests,
e.g., receptive items in which a test taker is asked to match a target word with other
related words or short definitions, or translation from the L1 into the L2 or vice versa
(Kremmel & Schmitt, 2016; Read, 2007).
An example of a receptive vocabulary size test is the Vocabulary Levels Test
(Nation, 2001), where the test taker sees six words and three short definitions. The test
taker has to match three out of the six words to the definitions (Schmitt, Schmitt, &
Clapham, 2001). The words are sampled from existing word frequency lists. An
example of a productive vocabulary size test is the Production Level Test (Laufer &
Nation, 1999), where the test taker has to complete underlined words in a sentence
(Nation, 2001). An example of a checklist or Yes/No test (YN test) is the Eurocentres
Vocabulary Test (Meara & Jones, 1988), in which the test taker confirms knowing or
not knowing a word. YN tests have the shortcoming that the test takers’ knowledge
cannot be verified (Stubbe, Stewart, & Pritchard, 2010; Stubbe, 2015). Therefore, for
the purpose of the project at hand, a productive vocabulary test will be included in
addition to a YN test.
In the following sections, only the two types of vocabulary size tests that are
considered for the dissertation project are discussed.
2.2.1 Yes/No tests
YN tests have been reported as a widely used measure for receptive vocabulary
size (e.g., Beeckmans, Eyckmans, Janssens, Dufranne, & Van der Velde, 2001; Beglar
& Nation, 2014; Pellicer-Sánchez & Schmitt, 2012; Read, 1993; Stubbe & Stewart,
2012). YN tests have often been used as placement tests (e.g., Harrington & Carey,
2009; Laufer & Goldstein, 2004; Laufer & Nation, 1999; Read, 2000) or diagnostic tests
(e.g., Read, 2007; Sevigny & Ramonda, 2013). These kinds of tests are time-effective,
easy to administer, and allow for testing a large number of words in a short amount of
time (Harrington & Carey, 2009; Meara, 1996; Read, 1993, 2007; Stubbe & Stewart,
2012). Completing a YN test places only limited demands on the test taker (Pellicer-
Sánchez & Schmitt, 2012). The basic design of YN tests consists of a test taker seeing
a word and indicating whether he/she knows the word (e.g., Beeckmans et al., 2001).
!
9
Items of YN tests can be sampled from word frequency lists, like the Academic Word
List (e.g., Gardner & Davies, 2013), or different 1,000-levels word lists from the British
National Corpus (e.g., Nation, 2004). The instructions of a YN test may also indicate
that there are pseudowords included (e.g., Stubbe et al., 2010).
YN tests have been criticized in the past. One criticism is the possibility of
overestimation (i.e., “falsely claiming knowledge of real words”, Stubbe et al., 2010, p.
4) and guessing on the part of the test takers, i.e., a test taker can rate more words with
“yes” than he/she actually knows (e.g., Read, 1993, 2000). In order to compensate for
this effect, Anderson and Freebody (1983) added pseudowords to their YN test.
Pseudowords are phonologically possible forms of a language (Mochida & Harrington,
2006). The terms pseudowords and nonwords are sometimes used interchangeably (for
a discussion see Sevigny and Ramonda, 2013). For the DSGS YN test in the present
study, the term nonsense signs will be used (Mann, Marshall, Mason, & Morgan, 2010).
Nonsense signs are phonologically plausible forms of a sign language that bear no
meaning in that language.
Scores of YN tests have often been used in conjunction with other measures of
vocabulary size knowledge (e.g., Vocabulary Levels Test; Nation, 2001) to control for
the responses of the YN test (e.g., Pellicer-Sánchez & Schmitt, 2012, Stubbe, 2012,
2015; Stubbe et al., 2010). The picture that emerges is very diverse. For example, Meara
and Buxton (1987) report a statistically significant correlation (r = .703) between the
scores of a YN test and a vocabulary multiple-choice test. A similar picture can be found
in the study from Anderson and Freebody (1983) (r = .84) or Mochida and Harrington
(2006), who applied different scoring methods for the YN test with a range of significant
correlations with the Vocabulary Levels Test (r = .85 to .88). However, other studies
show a different picture. For example, Cameron (2002) reports no significant correlation
between YN test scores and scores from the Vocabulary Levels Test across different
1,000-word levels, with Spearman rho ranging from .15 to .45. Similar results can be
found in the study of Eyckmans (2004) in that no significant correlations between YN
test scores and translations test scores was found (r = .03 to .05). The aspect of
correlation will also be assessed in the present study.
Correlations should ideally also be assessed for their shared variance (Green,
2013; Pallant, 2016). Only one of the studies reviewed here addresses this issue
explicitly, i.e., Mochida and Harrington (2006), who state that “[r]egardless of scoring
method, Yes/No test performance accounted for over 75% of variance in the Overall
!
10
VLT [Vocabulary Levels Test] scores” (p. 86). Shared variance will also be investigated
in the study at hand.
In a YN test, the words a test taker chooses as “known” are called hits, while
words rated as “unknown” are termed misses. Pseudowords rated as known are
considered false alarms, and pseudowords checked as unknown are correct rejections
(Mochida & Harrington, 2006; Stubbe, 2015). The distinction between these four
categories is relevant for the scoring of YN tests in the present study.
As Meara (2005) states, “[t]he real difficulty, however, is not the production of
the tests, but how we interpret the scores that the tests generate” (p. 278). Mochida and
Harrington (2006) also address the issue of how to score YN tests. The simplest solution
is to combine the correct responses (hits) and correct rejections to a total score.
However, according to Huibregtse, Admiraal and Meara (2002), this approach is
problematic, as the hit rate is an indicator of vocabulary knowledge, while the correct
rejections (or false alarms) are an indicator for the amount of guessing. Currently, there
are three different approaches for dealing with the false alarm rate (Schmitt, 2010;
Stubbe, 2015):
(1) Setting a maximum amount of false alarm responses as a threshold for
exclusion of a test taker: For example, Schmitt, Jiang, and Grabe (2011)
suggest a 10% acceptance rate (3 out of 30 items), and Stubbe (2012) applied
a 12.5% rate.
(2) Adjusting the YN scores by using one of several different proposed formulas
(see Huibregtse et al., 2002 or Stubbe, 2015). The simplest formula is to
subtract the false alarm rate from the hit rate to arrive at a score that better
reflects vocabulary knowledge (“true score”).
(3) Applying a regression model to use the YN test scores to predict scores of
translation tests (Stubbe & Stewart, 2012).
There does not seem to be a consensus as to what is the best approach or which
formula in Approach 2 works best (Schmitt, 2010). For the purpose of this study, the
second approach will be applied (hit rate minus false alarm rate /
! " #
).
Of interest is also the false alarm rate across different studies. The reported range
is considerable. For example, Mochida and Harrington (2006) report less than 5% false
alarms in a study with English L2 university students; Stubbe (2012) reports a false
alarm rate of a little over 4% in Japanese learners of English at university level;
!
11
Harrington and Carey (2009) report a false alarm rate of 17% in a study of English L2
learners in Australia, and Eyckmans (2004) reports 13-25% false alarms in French
learners of Dutch at different levels. The false alarm rate will also be investigated for
the study described in the present study.
Beeckmans et al. (2001) report that there are no clear guidelines regarding the
ratio of real words to pseudowords. For example, among the ratios reported in different
studies are 30:3 (Schmitt & Zimmermann, 2002), 90:60 (Mochida & Harrington, 2006),
96:32 (Stubbe, 2015), 60:40 (Eyckmans, 2004; Meara & Buxton, 1987), 72:28
(Harrington & Carey, 2009), and 40:20 (Meara, 1992). For the purpose of the present
project, a ratio of approximately 100:25 will be the goal.
2.2.2 Translation tests and scoring issues
Productive tests have been used as a means for verifying the test takers’ self-
reported vocabulary in a YN test (e.g., Stubbe et al., 2010; Stubbe, 2015). For the
purpose of this study, a productive test has been developed and used. The simple form
of a translation test is that an L1 word is provided to the test taker and he or she produces
the L2 translation (Laufer & Goldstein, 2004). In the study of Laufer and Goldstein
(2004), test takers produce the L2 translation in written form. In the DSGS vocabulary
test used in the present study, the L2 translation consists of a DSGS sign.
The responses are scored manually by two raters. Stewart (2012) cautions that
production tests that are hand-scored can result in an inconsistency between raters. An
important issue concerning the development of scoring instruments is (1) to define a
criterion of correctness and (2) to decide whether two (i.e., right/wrong) or more degrees
of correctness (with partial credit) should be used (Bachman & Palmer, 1996). For
assessing a single area of language knowledge, such as vocabulary, the right/wrong
distinction can be useful, but when different areas of language knowledge are assessed,
partial credit on more levels might be needed (Bachman & Palmer, 1996). The basis for
the development of the scoring instrument for the study at hand will be presented in
Section 2.3.1.
2.3 Structure, Acquisition, Learning, and Assessment of Sign Languages, with
Particular Emphasis on DSGS
Sign language research, which started in the early 1960s in the United States
(Stokoe, 1960), is a rather young field of linguistics. Research on DSGS began in the
!
12
1980s in the German-speaking part of Switzerland (Boyes Braem & Rathmann, 2010;
Boyes Braem et al., 2012). Research has shown that sign languages have their own
grammar and lexicon (e.g., Pfau, Steinbach, & Woll, 2012), that children acquiring a
sign language as L1 follow the same developmental stages as children acquiring a
spoken language as L1 (e.g., Lillo-Martin, 1999; Morgan & Woll, 2002; Mayberry &
Squirres, 2006), and that sign languages are processed in similar brain regions as are
spoken languages (e.g., Emmorey, 2011).
The major focus in this section will be on the phonological structure and the
lexicon of DSGS, as these aspects are most relevant for the development of the
vocabulary tests that are at the core of the dissertation.
2.3.1 Basic structure of sign languages
A widely held misconception of sign languages is that they are universal. Sign
languages have some features in common, stemming from their shared visual-spatial
modality, and they also seem to share some properties of gestural communication (e.g.,
Emmorey & Herzig, 2003). However, sign languages are also distinct from each other,
as cross-sign linguistic studies have shown (Zeshan, 2004a, 2004b; Zeshan & Perniss,
2008). There is even evidence of regional variation within some sign languages, for
example, in DSGS, which has no standardized form but is composed of five regional
dialects (Boyes Braem, 1984).
Sign languages typically feature many signs with a high degree of iconicity, i.e.,
signs whose phonological forms bear some resemblance to some aspect of the referents.
For example, the sign AUTO
4
(‘car’) in DSGS depicts a person using the steering wheel
of a car (see Figure 1). (For a discussion of iconicity in sign languages, see Langer,
2005.)
Whereas in spoken languages, the larynx, throat, palatine, tongue, teeth, and lips
serve as articulators, the articulators in sign languages have been divided into two
distinct categories: manual and non-manual components (e.g., Boyes Braem, 1995). The
manual components are the hands and the arms. Non-manual components include
several features of the face (mouth, cheeks, eyes, nose, eyebrows, eye gaze) as well as
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
4
AUTO (‘car’) is an example of a sign language gloss, a label reflecting a principal aspect of the meaning
of a sign (e.g., Ebling, 2016). Glosses are typically written in all caps. In this paper, the German glosses
for DSGS signs are complemented with the English meaning in parentheses.
!
13
positions and movements of the head and the upper torso (Boyes Braem, 1995; Sutton-
Spence & Woll, 1999). Manual and non-manual components are usually produced
simultaneously.
Non-manual components that are produced with the mouth can refer to the lip
movements of a spoken word, termed mouthings, and have been found to be common
in most European sign languages. One of their uses is to distinguish between signs that
have the same manual form (manual homonyms), as with the DSGS signs BRUDER
(‘brother’), SCHWESTER (‘sister’), and GLEICH (‘same’). Another non-manual
component, eye gaze, can be used, e.g., to re-establish reference in signing space, and
raised eyebrows can be added to manual signs to distinguish an interrogative sentence
from a declarative sentence (e.g., Pfau & Quer, 2010).
Figure 1: The lexical sign AUTO (‘car’) (Boyes Braem, 1995; reprinted with
permission of the author)
The manual components are traditionally considered the smallest building
blocks of sign language phonology. These sub-lexical units have been termed the basic
parameters of signs. They were first investigated in ASL by Stokoe (1960) and later
extended by Battison (1978) and Klima and Bellugi (1979). The sub-lexical manual
parameters, which together compose a sign, are the following:
(1) Handshape: The configuration of the hand, e.g., a fist or an open flat hand.
Different sign languages use slightly different sets of handshapes, even
though there are six unmarked handshapes (Figure 2) that occur in most sign
languages known to date. These unmarked handshapes are also the easiest to
!
14
produce and to perceive and are among the first to be acquired by children
(e.g., Johnston & Schembri, 2007).
(2) Location: Where a specific sign is articulated, e.g., in front of the signer in
the signing space, on the upper torso, on the arm, or the face.
(3) Movement: The form and path of the movement of the hand is described
along three different planes in 3D space and with respect to speed (fast or
slow).
(4) Hand orientation: The orientation of the palm of the hand, e.g., towards the
signer’s body or away from the signer’s body. For open handshapes, the
orientation of the extended fingers is additionally described.
Figure 2: The six unmarked handshapes (Boyes Braem, 1995; reprinted with permission
of the author)
Many sign language users also make use of a manual alphabet, in which a
specific handshape represents a letter of the alphabet of the surrounding spoken
language so that entire spoken language words can be “spelled out” in space. This
communication system, also known as fingerspelling, is used together with DSGS to
introduce new or unknown concepts for which there are no established signs, such as
proper names of persons or locations (Boyes Braem, 1995).
Concerning the sign language lexicon, Johnston and Schembri (2007) propose a
model for the organization of the mental lexicon in sign languages, based on their
research on Auslan. They divide the mental lexicon into a native and a non-native sign
language lexicon. The native lexicon is further divided into a conventional and a
productive lexicon (Figure 3). The conventional (or, established) lexicon consists of
signs (lexical types) that have a stable form-meaning relationship, e.g., the DGS sign
AUTO (‘car’), which can be used in different contexts without a change in meaning
(König, Konrad, & Langer, 2012).
!
15
Figure 3: Model of sign language lexicon (according to Johnston & Schembri, 2007)
!
The productive lexicon is quite different and does not consist of an easy-to-
determine number of signs. Sign forms that can be labeled as “productive” are realized
and understood in a given context to convey a specific meaning. The signs themselves
are not conventionalized, although their sub-lexical units (especially the handshapes)
are. The sub-lexical units of productive signs are combined in a context-specific way to
convey, e.g., the meaning of “a person is approaching me”. To represent the concept of
person, the signer needs to select a specific handshape (often a single upright index
finger) and the location, movement, and orientation of the hand, then transmit the
meaning of how and from where the person is approaching and with what kind of path
(straight, wavy, etc.). Accordingly, when the sign is produced in a different location
with a different direction and manner, the meaning can change from “a person came
straight at me” to “meandered slowly away”. Because of the multiple possibilities for
varying the parameters of the form, no citation entry in the mental lexicon is possible,
i.e., no base form exists for productive signs. This is why productive forms, while used
extensively in actual signing, often do not appear in (printed or electronic) sign language
lexicons.
The number of conventional sign types of a sign language is difficult to
determine: Estimates range from 2,500 to 5,000 signs for Auslan and DGS, respectively
!
16
(Ebbinghaus & Heßmann, 2000; Johnston & Schembri, 1999). Since there are a
potentially large number of context-specific meanings, the size of the productive lexicon
cannot be determined.
The non-native lexicon describes the parts of a sign language where, e.g., loan
signs from other sign languages are conceptualized, which (through the process of
lexicalization) may eventually become part of the native conventional lexicon.
For the purpose of the present research, only signs of the native, conventional
lexicon will be considered for the vocabulary test. In order to arrive at a concept
comparable to that of word families (Read, 2000), one could include signs that involve
morphological changes to the lexical base form (Johnston & Schembri, 2007). However,
the problem remains that this group of signs is less clearly defined for sign languages
than for spoken languages, which would have an impact on the definition of what is a
correctly produced sign in a DSGS vocabulary test. Just considering sign types that are
known to have a stable form-meaning relationship (König et al., 2012) is further
complicated by the fact that there exists little research on acceptable phonetic variations
of signs (for an exception, see Arendsen, 2009). A study investigating acceptable
variants of lexical signs in L1 and L2 users of DSGS is currently underway (Ebling et
al., submitted). In this study, 40 L1 and L2 learners were prompted with a DSGS gloss
and an associated German example sentence and asked to produce the sign. The data
were later analyzed by two Deaf and one hearing sign language researcher, who grouped
the productions according to the following six categories:
(1) Identical to target sign.
(2) Acceptable variant of the target sign, i.e., has the same meaning but a slightly
different form (these are the acceptable phonetic variants mentioned above).
(3) Acceptable variant involving a grammatical or semantic modification of the
target sign, which results in slightly different but closely related meaning,
e.g., repetition of the sign SPRACHE (‘language’) to indicate the plural
form.
(4) Acceptable variant of the target sign in that it has the same meaning but a
different form, e.g., a dialect variant (recall that DSGS features five different
dialects).
(5) Not an acceptable variant of the target, e.g., although the produced form has
the same meaning, it involves, for example, an unacceptable handshape.
!
17
(6) Not an acceptable variant of the target in that it not only has an entirely
different form but also a different meaning (Ebling et al., submitted).
These six categories served as the basis for the development of the scoring
instrument of the translation test. Categories (1) to (4) were used as criteria to judge a
produced sign as correct, while Categories (5) and (6) were used as criteria to classify
signs as incorrect.
2.3.2 Sign language acquisition and learning
There have been quite a few studies on the acquisition of a sign language as a
first language of children, for example, on the acquisition of first signs (for ASL:
Anderson & Reilly, 2002; for BSL: Woolfe, Herman, Roy, & Woll, 2010). However,
very little research exists on the learning of a sign language by adults (e.g., Woll, 2013).
Some researchers distinguish between bimodal and unimodal language learning.
Bimodal language learning has as its subjects hearing adults who have acquired a spoken
language from birth and now are learning as an L2 a sign language produced and
perceived in a different modality. Unimodal language learning studies investigate Deaf
adults who have acquired one sign language as L1 and learn a second sign language as
an L2 in adulthood (Chen Pichler, 2013). This distinction is of importance when
studying sign language learning by adults but will not be considered for the purpose of
this study. Previous studies have focused on (phonological) production errors in
beginning adult learners of a sign language (e.g., Chen Pichler, 2010; Rosen, 2004), on
the potential effect of transfer of prior gestural experience of hearing learners to sign
language learning (e.g., Chen Pichler, 2010; Taub, Galvan, Pinar, & Mather, 2008), the
different uses of eye gaze to establish referents in the ASL verb system in L1 and L2
learners (Thompson, Emmorey, & Kluender, 2009), or the difference in the lexical
processing of iconic signs in L1 and L2 users (Thompson, Vinson, & Vigliocco, 2009).
However, to date, there are no studies investigating the learning of the lexicon of a sign
language by adult learners that could inform the development of vocabulary size tests
in sign languages.
2.3.3 Sign language tests
Most publications dealing with sign language tests target Deaf children
acquiring a sign language as a first language (for an overview, see, e.g., Enns et al.,
2016; Haug, 2005). Among them are a number of publications dealing with the
!
18
development of vocabulary tests (picture-naming tasks), either as an integrated part of
a larger test battery (e.g., for Sign Language of the Netherlands: Hermans, Knoors, &
Verhoeven, 2009) or as an independent vocabulary test to evaluate the strength of
vocabulary knowledge in Deaf children (Mann & Marshall, 2012; Mann, Roy, &
Morgan, 2015).
Only few publications introduce test instruments for adult learners of a sign
language. One example of such an instrument is the Sign Language Proficiency
Interview for ASL (SLPI; e.g., Caccamise & Samar, 2009), which is an adaptation of
the Oral Proficiency Interview for English. The scoring instrument of the SLPI includes
the criterion “vocabulary knowledge” but only as a very broadly defined construct
across different levels. Other examples of sign language tests for adult learners are the
Sentence Reproduction Test for ASL (Hauser, Supalla, & Bavellier, 2008) as a “global,
objective assessment ASL proficiency test” (p. 171) or the ASL Discrimination Test
(Bochner et al., 2016), which tests the “learners’ ability to discriminate phonological
and morphophonological contrasts in ASL, [and] provides an objective overall measure
of ASL proficiency” (p. 473). In short, both due to the fact that there are (1) no
acquisition studies on the learning of vocabulary in adult learners and (2) no vocabulary
tests available in the published literature, test development cannot be informed by
current research.
2.4 Test Takers’ Feedback in Language Testing Research
The literature for spoken language testing emphasizes the importance of
involving test takers in the validation process (e.g., Weir, 2005). The focus of involving
test takers, however, can be very diverse and includes, for example, investigating the
test takers’ attitudes and acceptance of computer-delivered TOEFL tests (Stricker,
Wilder, & Rock, 2004), investigating the consequences of taking a test from the
perspective of a test taker (Ryan, 2014), and test takers’ perceived difficulty of language
tasks (e.g., Elder, Iwashita, & McNamara, 2002). Other studies have focused on test
takers’ feedback on a language test as input for quality management (Yan,
Thirakunkovit, Kauper, & Ginther, 2016), test takers’ feedback on what a test measures
(Sato & Ikeda, 2015), and test takers’ feedback during the development of a language
tests (e.g., Brown, 1993). While a complete review of the existing studies on test takers’
involvement in language testing research is beyond the scope of this study, suffice it to
say that feedback from test takers is important and has been included in the present
!
19
study.
Whereas the involvement of test takers in studies on language testing seems to
be quite common, for sign language test research, there is, to the knowledge of this
researcher, no published study that addresses the involvement or even feedback from
test takers.
2.5 Reliability and Validity
Reliability provides information on the consistency of a measure (Mousavi,
2012; Rust & Golombok, 2000). Three measures of reliability are commonly reported
in the literature. The first is the stability over time (test-rest reliability), the second the
internal consistency of a test, and the third is inter-rater agreement (Kline, 2000). For
the purpose of this study, only the latter two types of reliability will be investigated. The
internal consistency of a test refers to “the degree to which scores on individual items
or group of items on a test correlate with one another” (Davies et al., 1999, p. 86). A
measure of internal consistency is Cronbach’s Alpha, which will be applied for the YN
test and the translation test, separately for Rater 1 and Rater 2. A Cronbach’s Alpha
value of above .70 is acceptable, but >.80 is preferable (Pallant, 2005).
Inter-rater reliability refers to the level of agreement between two or more raters
on a participant’s performance (Davies et al., 1999), for example, by video-recording a
test taker’s sign language production and then comparing the scoring of two different
raters. For nominal level data and the case of two raters, Cohen’s Kappa can be applied
(Cohen, 1960).
The core property of a valid test is that it measures what it claims to measure
(Kline, 2000). There are different kinds of validity (see below). For the current study,
only content, face, and concurrent validity will be targeted. Content validity refers to
“the attempt to show that the content of the test is a representative sample from the
domain that is to be tested” (Fulcher & Davidson, 2007, p. 6). Davies and colleagues
refer to content validity as a “systematic analysis of the test content to determine
whether it includes an adequate sample of the target domain to be measured” (Davies
et al., 1999, p. 34). According to Alderson, Clapham, and Wall (1995), face validity
“involves an intuitive judgement about the test’s content by people whose judgement is
not necessarily ‘expert’” (p. 172). The judgement can come from test takers or
administrators and can be more holistic (Alderson et al., 1995). Concurrent validity can
!
20
be demonstrated through a high correlation between the targeted test and another test
that measures the same criterion or construct (Davies et al., 1999).
2.6 Chapter Summary and Research Questions
Vocabulary knowledge can be seen as contributing to a larger picture of
language ability, and thus can inform the development of vocabulary tests (e.g., Read
& Chapelle, 2001; Schmitt, 2010). Framing DSGS vocabulary knowledge as part of a
model of CLA (Bachman, 1990) in order to design an embedded, comprehensive, and
context-dependent vocabulary test would be desirable. However, considering the state
of research of sign languages in general and of DSGS in particular, this is not possible.
As the preceding sections have shown, the state of sign language research is immature
compared to that of spoken language research. This is even more true for the specialized
field of sign language testing.
Based on the reviewed literature on vocabulary assessment for spoken language
(e.g., Read, 2000) and sign language structures and assessment, it seems reasonable to
claim that basic concepts such as vocabulary size and reception/production are equally
applicable to sign languages. There is a difference in how the mental lexicon of sign
languages is organized (Johnston & Schembri, 2007) compared to that of spoken
languages. This has an influence on how the basic unit of vocabulary tests (for English:
word families; for DSGS in this study: sign types) are defined. The basic unit here is
also influenced by the state of sign language research, in particular, research on
acceptable phonetic variants in DSGS. Studies on different forms of vocabulary
assessment, i.e., YN tests (e.g., Beeckmans et al., 2001) and translation tests (e.g.,
Laufer & Goldstein, 2004), will lend methodological support for developing and
evaluating reliability and validity of the two DSGS vocabulary tests. Test takers’
feedback will also be included to approach face validity.
The construct of the two DSGS vocabulary tests will be defined as the size of
vocabulary knowledge of beginning adult learners of DSGS at the level of A1, which
results in a discrete, selective, and context-independent vocabulary size test (Read,
2000). The next step is the operationalization, which “enables us to relate the constructs
we have defined theoretically to our observations of behavior. This step involves, in
essence, determining how to isolate the construct and make it observable” (Bachman,
1990, p. 42-43). The final step will be the development and evaluation of the two DSGS
vocabulary tests. This leads to the following research questions:
!
21
(1) Do the two vocabulary tests show evidence of psychometric properties?
a. Item analysis: Do the tests show evidence of item facility and item
discrimination?
b. Is there evidence of reliability in the two vocabulary tests?
i. Is there evidence of internal consistency?
ii. Is there evidence of inter-rater reliability in the scorings of the
translation test?
c. Is there evidence of validity in the vocabulary tests?
i. Is there evidence of content validity?
ii. Is there evidence of face validity (as included in test takers’
feedback)?
(2) Performance of the DSGS learners on the two tests:
a. Can the variables (1) self-assessment of DSGS skills, (2) number of
DSGS courses attended, and (3) contexts in which DSGS has been
learned contribute to the (concurrent) validity of the two vocabulary
tests?
b. Is there a relationship between the scores of the YN test and the
translation test?
(3) How efficient are the nonsense signs in the YN test?
3. Methodology
In Section 3.1, the development of instruments, in Section 3.2, the recruiting of
study participants, in Section 3.3, the pilot study and the revisions thereof, in Section
3.4, the main study, and in Section 3.5., the methodological approaches chosen for
analyzing the data are presented. Table 1 provides an overview on the different steps in
the methodology section.
Table 1
Overview of Studies and Instruments, with Section Numbers
Study
Date
Number of
participants
Instruments used
!
22
Pilot study
test takers
(Section
3.3.1)
June 17,
2017
3
- Yes/No Test (Section 3.1.2)
- L1/L2 Translation Test (Section
3.1.3)
- Background questionnaire for test
takers (Section 3.1.4)
- Feedback questionnaire (Section
3.1.5)
- Consent form (Section 3.1.6)
- Background information of project
(Section 3.1.6)
Pilot study
raters
(Section
3.3.2)
June 17,
2017
2
- Video-recordings of L1/L2
Translation Test (Section 3.1.3)
- Scoring instrument (Section 3.1.3)
- Background questionnaire for
raters (Section 3.1.4)
- Consent form (Section 3.1.6)
Main study
test takers
(Section 3.4)
June/July 17,
2017
20
- Yes/No Test (Section 3.1.2)
- L1/L2 Translation Test (Section
3.1.3)
- Background questionnaire for test
takers (Section 3.1.4)
- Feedback questionnaire (Section
3.1.5)
- Consent form (Section 3.1.6)
- Background information of project
(Section 3.1.6)
3.1 Development of Instruments
For the purpose of this research project, the following instruments were
developed:
(1) Yes/No Vocabulary Test for DSGS (from here on termed simply “YN test”);
(2) L1/L2 Vocabulary Translation Test for DSGS (from here on referred to as
“translation test”);
!
23
(3) Scoring instrument for the translation test;
(4) Background questionnaires for the test takers and raters;
(5) Feedback questionnaire for the two DSGS vocabulary tests, to be completed
by the test takers.
In the following subsections, the development of the individual instruments is
described in more detail.
3.1.1 Item selection for the two vocabulary tests
Due to the absence of a large corpus for DSGS, it was not possible to create a
corpus-based frequency list of DSGS signs as it exists for English (e.g., Laufer et al.,
2004) and could be used as the basis of a vocabulary test.
The researcher made use of a list of 110 DSGS vocabulary items (including 10
practice items) developed as part of a project entitled “Scalable Multimodal Sign
Language Technology for Sign Language Learning and Assessment (SMILE)”
(http://www.idiap.ch/project/smile/) funded by the Swiss National Science Foundation.
The project aims at developing an automatic sign language recognition system to
employ within a DSGS vocabulary test for the CEFR level A1. The items used in the
test were selected from existing DSGS teaching materials (Boyes Braem, 2004a, 2004b,
2005a, 2005b) known to correspond to the CEFR level A1. The DSGS teaching
materials are used as part of four levels of DSGS courses offered by the Swiss Deaf
Association. The number of sign types available in the DSGS teaching materials is
approximately 3,800 (Boyes Braem, 2001). In order to reduce this number to 100+10
(test and practice items), the following steps were applied (Boyes Braem, Ebling, Sidler-
Miserez, Tissi, & Haug, 2016):
(1) Remove name signs, i.e., signs for persons (e.g., CHARLY CHAPLIN),
organizations (e.g., name of a university), and places (e.g., country names),
as many of these are borrowed from other sign languages.
(2) Remove body-part signs like NASE (‘nose’), as these are often produced
by merely pointing at the respective body part, i.e., using an “indexing
technique”.
(3) Remove pronouns like DU (‘you’), as they also correspond to indexical
signs.
(4) Remove number signs, as they tend to have several regional variants, e.g.,
the number sign ELF (‘eleven’).
!
24
(5) Remove signs making use of fingerspelling, like the sign JANUAR
(‘January’), which involves the letter J from the DSGS manual alphabet.
(6) Remove signs composed of multiple successive elements, as most of these
signs also occurred in the DSGS teaching materials as separate lexemes.
For example, the sign ABENDESSEN (‘dinner’) is composed of the two
signs ABEND (‘evening’) and ESSEN (‘meal’), both of which are also
contained in the list of sign types of the DSGS teaching materials.
(7) Remove old signs, as current DSGS learners cannot be expected to know
them.
(8) Remove productive forms. The reason for this step was that the
phonological parameters of productive signs tend to be variable, which
poses a great challenge to the sign recognition system that is part of the
assessment framework in the SMILE project.
(9) Remove signs appearing in fewer than four of the five DSGS dialects.
(10) Reduce manual homonymy: Since the goal was to have as many different
sign forms in the vocabulary test as possible, form-identical signs were
identified (e.g., BRUDER (‘brother’), SCHWESTER (‘sister’), and
GLEICH (‘same’)) and only one chosen for the test. Preference was given
to that sign which was contained in a list of 1,000 common sign concepts
(Efthimiou et al., 2009).
(11) Remove signs that are very similar to well-known co-speech gestures, such
as the sign SUPER, which corresponds to a thumb-up gesture.
(12) Remove signs with German glosses that are lexically ambiguous. For
example, the German word AUFNEHMEN can have the meaning of record
or accept/include, concepts which in DSGS are expressed with two separate
signs. In cases like these, test takers confronted with the German gloss
AUFNEHMEN would not know which sign to produce.
(13) From the resulting pool of signs, concepts that also occurred in studies
investigating familiarity or subjective frequency ratings for BSL (Vinson,
Cormier, Denmark, Schembri, & Vigliocco, 2008) and ASL (Caselli,
Sehyr, Cohen-Goldberg, & Emmorey, 2017; Mayberry, Hall, & Zvaigzne,
2013) and in the list of 1,000 sign concepts (Efthimiou et al., 2009) were
prioritized. In this way, the 3,800 sign types from the DSGS teaching
materials were reduced to a set of 100+10 test items.
!
25
The item set was not balanced with respect to parts of speech, as is often done
when sampling items for a spoken language vocabulary test. This was because the
question of whether the concept of parts of speech can be applied to sign languages is a
highly debated one within sign linguistics (see, e.g., Erlenkamp, 2001).
The 110 items were evaluated as part of a pilot study in the SMILE project.
5
As
a result, their number was reduced to 98 (5 practice and 93 test items). Typically, the
idea of practice items is to make the test taker familiar with the test format;
consequently, the results are not used for scoring. However, the study at hand is of
exploratory nature; therefore, data elicited from the practice items was nevertheless
included in the analysis.
3.1.2 Yes/No Test for DSGS
The same items were used for the YN test and the translation test. As verification
of the test takers’ knowledge is not possible (Stubbe, 2015), 30 nonsense signs were
created to control for the self-report aspect of the YN test. A similar approach was
applied for developing the nonsense signs for the DSGS YN test. The difference was
that the signs were not developed but selected by a Deaf researcher from the online
resource “Spread the Sign”, a platform that includes signs from different sign languages
(http://www.spreadthesign.com). A similar approach was previously pursued for a
familiarity study of DGS signs in Germany (Hanke, 2014). The Deaf researcher chose
signs from eleven different sign languages that were plausible DSGS signs with respect
to their manual phonological form. The meanings of the nonsense signs were directly
translated from the meanings of the respective sign languages, i.e., the nonsense signs
consisted of a borrowed phonological form together with a new German mouthing
corresponding to its meaning in the source language. For example, the form of the BSL
sign KNOW was used as a nonsense sign for DSGS along with the German mouthing
/kennen/ (‘to know’). A Deaf trained sign language instructor evaluated the resulting
signs to verify that their phonological form indeed qualified them as possible DSGS
signs. The instructor was asked (1) whether the sign she was presented was a DSGS
sign (yes/no; if the answer was yes, the sign was excluded from the set of nonsense
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
5
For the pilot study, recordings of 40 L1 and L2 users of DSGS performing the 100+10 items were
collected. Twelve items were then removed from this list because their glosses still proved to be
ambiguous, i.e., test takers produced too many different sign variants for them.
!
26
signs) and, if the answer was no, (2) whether the sign was a plausible DSGS sign judging
by its manual phonological form only (i.e., handshape, movement, hand position, and
location). This resulted in 25 nonsense signs (Appendix A1 and A2) that were included
in the YN test, amounting to a total of 123 items (= 5 practice items + 93 test items +
25 nonsense signs).
All signs (i.e., the nonsense signs and the DSGS vocabulary items) were video-
recorded for the YN test, signed by an experienced DSGS model and using a blue screen
background. Following the five practice items placed at the beginning, the order of the
remaining signs was randomized as in the original list from the SMILE project. The
items received new item numbering (Appendix B1 and B2).
For the YN test, a Limesurvey installation was used. Limesurvey
(http://www.limesurvey.org) is an open-source software to conduct surveys. It allows
for embedding videos (.mp4 and .webm format) directly through the graphical user
interface. The YN test consisted of a welcome message/introduction in written German
(Appendix C1 and C2), followed by the videotaped signs. The test takers were not
informed that nonsense signs were included because the researcher hypothesized that
doing so might lead them to focus more on the mouthings to infer meanings. (The pilot
study later confirmed that test takers focused on the mouthing when they could not
understand a sign “as a whole”, i.e., the manual components together with the mouthing,
as further explained in Section 3.3.1.3.) Each sign was presented separately with the
question “Do you know the form and the meaning of the sign?”, with a “Ja”/“Nein”
(‘yes’/‘no’) button underneath, a “Weiter” (‘continue’) button to proceed to the next
item (Figure 4), and a “Zurück” (‘back’) button to return to the preceding item. By
default, the test taker was forced to choose one of the two response options (yes or no),
i.e., omitting responses was not possible. (There was only one occurrence of omission
due to a problem with the Internet connection.) All data were automatically sent to a
secure server for storage. The test was self-administered, but the researcher was present
during test taking in case there were any technical issues. The results were later imported
into SPSS for analysis.
!
27
Figure 4: Yes/No Test Practice Item 1 WIDERSPRUCH (‘contradiction’) (post-pilot
version with slightly changed wording of the task)
3.1.3 L1/L2 Translation Test and scoring instrument for DSGS
For the translation test, the same set of concepts was used, delivered in the same
order as in the YN test but omitting the nonsense signs, i.e., 98 signs (5 practice and 93
test items). The test was delivered as a PowerPoint presentation on a laptop. Following
an instruction in written German on the first slide, on each slide, the German word
appeared and a German sentence designed to disambiguate the meaning was provided.
For example, to avoid confusion of German homonyms like “Schloss”, which can have
the meanings of castle and lock, which are expressed by two different signs in DSGS,
the following example sentence was provided (Figure 5): “In Lenzburg steht ein
Schloss.” (‘There is a castle in Lenzburg [name of a Swiss city].’). The test taker was
seated at a table facing a video camera and with the laptop placed on the table to the
!
28
right- or left-hand side of the test taker, depending on his or her preference. The test
administrator/researcher clicked through the slides so that the test taker could look
directly at the video camera while producing the sign.
Figure 5: Sample slide from the L1/L2 Translation Test showing Item 10, the German
word “Strasse” (‘street’) and the sentence “Diese Strasse ist neu.” (‘This street is new.’)
The criterion of correctness (Bachman & Palmer, 1996) was defined as the
accuracy of the translation of the German word into DSGS. Building on the work of
Ebling and colleagues (submitted) on acceptable variants of lexical DSGS signs, the
resulting categories (as presented in Section 2.3.1) informed the criterion of correctness.
For a correctly produced DSGS sign, a score of “1” was assigned, for an incorrect form,
a score of “0”. After deciding if a sign was correct or not, the rater could, as data for
future research, also specify one of the six categories that cover acceptable or
unacceptable variants in this study. The scoring instrument was created using Excel
(Appendix D). Apart from the decision of whether the sign was correct or not, an
additional category “no sign produced” was added. Test takers also received a score of
“0” for these not produced signs, but the data were collected separately from the
incorrect signs to allow for a more fine-grained analysis of inter-rater reliability. The
!
29
scoring instrument also included the possibility to comment on the test takers’
performance. The video-recorded data were scored independently by two raters and
compared afterwards.
3.1.4 Background questionnaires
Two sets of background questionnaires were developed, (1) one for the hearing
test takers (Appendix E1 and E2, post-pilot version) and (2) one for both Deaf sign
language instructors
6
as part of a focus group and the Deaf raters (Appendix F1 and F2).
Both questionnaires were also delivered using Limesurvey. They represent adapted
versions of questionnaires used in the SMILE project, which in turn are based on
questionnaires developed as part of the German Sign Language Corpus project at the
University of Hamburg (Hanke, 2016). A self-assessment scale for test takers’ DSGS
skills was developed, based on a scale of the Language Experience and Proficiency
Questionnaire (LEAP-Q; Marian, Blumenfeld, & Kaushanskaya, 2007) as a result of the
pilot study (Section 3.3.1.3).
The background questionnaire for the test takers consisted of four sections,
covering (1) general background information (e.g., name, gender, age, place of
residence), (2) language and language learning background, including variables such as
first and preferred language(s), DSGS courses attended, and self-judgment of the
receptive and productive DSGS skills. Section (3) asked about background information
on professional/vocational training, work context (i.e., current position, use of DSGS at
work), and test takers’ use of DSGS in their free time. In Section (4), test takers were
asked about Deaf family members and, if present, whether they communicate with them
in DSGS. The name of the test taker was required in order to match the different
Limesurvey questionnaires, i.e., background questionnaire, YN test, and feedback
questionnaire, to the same person. The background questionnaire for the test takers
consisted of 32 questions. It took an average of ten minutes to complete the
questionnaire.
The background questionnaire for Deaf raters consisted of 42 questions. The
structure of the questionnaire was as follows: (1) general background information (e.g.,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!
6
Sign language instructors (N = 5) were included in a focus group session in order to receive some general
feedback on the two tests and the scoring criteria for the translation test. Due to time constraints, the data
was not evaluated for this dissertation.
!
30
name, gender, age, place of residence), (2) information on languages used and age of
acquisition, (3) information on professional/vocational training (including if raters were
trained sign language teachers) and current job situation (including questions on
teaching DSGS to adult learners), (4) family environment and language(s) used therein.
It took about ten minutes to fill out the background questionnaire.
3.1.5 Feedback questionnaire for test takers
Test takers were asked to fill out a short online feedback questionnaire after the
testing, also using Limesurvey (Appendix G1 and G2). The goal of this questionnaire
was to gather general feedback on the two different vocabulary test tasks. Studies on
test takers’ feedback for spoken language tests informed the development of this
questionnaire (e.g., Brown, 1993; Ryan, 2014; Sato & Ikeda, 2015; Yan et al., 2016).
The questionnaire consisted of a total of 13 questions. Apart from the name, test takers
were asked the following set of questions for both the YN and the translation test:
questions as to whether the introduction and purpose of the test were
clear/comprehensive, whether the tests matched the test takers’ DSGS skills, whether
the length of the tests was appropriate, whether the test was a good means for checking
DSGS learning, and whether one of the two test formats (YN test vs. translation test)
was preferable, including the possibility to leave a written comment. Filling out this
questionnaire took approximately five minutes.
3.1.6. Consent forms and information about the project
Two consent forms were developed, one for the test takers (Appendix H1 and
H2) and a slightly different version for the sign language instructors and raters
(Appendix I1 and I2). Both consent forms were based on templates developed at
Lancaster University and approved by the director of program at Lancaster University.
The consent form of the test takers also included the possibility of agreeing to the data
being shared in the larger context of the SMILE project after completion of the study.
Additionally, also based on a template from Lancaster University, a longer text with
some background information about the dissertation project was prepared, in some
sections with separate information for the test takers and the sign language teachers and
raters (Appendix J1 and J2). A fact sheet describing the SMILE project was also
provided (Appendix K1 and K2). All documents were created in or translated into
German by the researcher.
!
31
3.2 Recruiting Study Participants: Test Takers and Raters
For this study, the primary characteristic of the test takers (both in the pilot and
main study) was defined as that of them being beginning adult learners of DSGS as
L2/foreign language, i.e., roughly having DSGS skills at A1 level according to the
CEFR. The researcher sent out an e-mail message as a call for participation (Appendix
L1 and L2) through different personal and professional networks (e.g., schools for the
Deaf, Swiss Deaf Association, professional organization of sign language teachers,
personal contacts to sign language teachers, own work context) and also posted the call
via different social media channels (Facebook, Twitter). In the call, the purpose of the
study was briefly explained and a link to a Doodle survey (http://www.doodle.com)
provided. This survey was anonymized, meaning test takers who signed up could not
see who else had signed up, only the time slots that were already taken. The recruitment
of test takers for the pilot study (N = 3) and the main study (N = 20) took place between
April and May 2017. The pilot study took place in the beginning of June 2017, the main
study between the end of June and the end of July 2017.
The two raters, who are also members of the SMILE project team, were recruited
by the researcher directly.
3.3 Pilot Studies
Two pilots were conducted, Pilot 1 addresses the test takers and Pilot 2 the raters.
3.3.1 Pilot study 1
3.3.1.1 Participants
There were three participants of the pilot study, two of which were female and
one male. The age range was 26 to 50 years (M = 36.3). Two of the test takers reported
having a spoken Swiss German dialect as their L1 and having learned DSGS as adults;
one listed as L1 a combination of Swiss German, Signed German (i.e., using signs, but
speaking at the same time and follow the syntax of German), and DSGS. The criteria
defined above regarding the DSGS level was ignored here since the goal of this pilot
study was solely to test the procedure and to gain feedback from the test takers.
Therefore, also no additional background variables will be reported here.
!
32
3.3.1.2 Procedure
The goal of the pilot study with the test takers as participants was twofold: (1)
to test the entire procedure and (2) to gain feedback on the different test instruments.
The pilot study session took place in a quiet room during normal working hours at the
researcher’s workplace. The researcher was present throughout the session. The session,
which took approximately one hour, included the following steps in the given order:
(1) Getting information about the project: The test takers received background
information on the MA dissertation project, were given time to fill out the
consent forms, read through the information sheets on the dissertation project
and the SMILE project, and ask questions;
(2) Filling out the background questionnaire online on a laptop;
(3) Taking the YN test online on a laptop;
(4) Taking the translation test (this included video-recording the test takers for
later analysis);
(5) Filling out the feedback questionnaire online on a laptop;
(6) Giving additional feedback: The researcher asked the test takers for
additional (general) feedback. The researcher also enquired about strategies
applied in cases where test takers did not understand a sign in the YN test.
3.3.1.3 Results
The results of Pilot Study 1 and, where applicable, changes made for the main
study are described below.
(1) YN test: Based on the oral feedback from the test takers, the wording of the
YN test was changed. In the pilot version, test takers had been asked whether
they knew the meaning and the phonological form of the sign (both in the
introduction/instructions and for each item). Since it was crucial to find out
whether the nonsense signs worked in the test, test takers of the pilot study
were asked about strategies they applied in cases where they did not
comprehend a sign. One test taker reported having replied “Nein” (‘no’) to
questions whether he/she knew the nonsense signs, the other two reported
that if they did not understand a sign as a whole, they looked at its mouthing
(recall from Section 3.1.2 that the nonsense signs had German mouthings) to
retrieve its meaning. In the revised version, both in the
introduction/instructions and for each item, the question was rephrased such
!
33
that knowing a sign means that the sign in its entirety must be familiar, i.e.,
both its manual components and its mouthing.
(2) Background questionnaire: The test takers reported that it would be good to
have a more detailed scale for assessing one’s own DSGS skills, and to have
the possibility of leaving a comment for the question related to language
learning. Both feedback items were included in a revised version of the
questionnaire. As for the self-assessment of DSGS skills, the German
translation of a ten-point scale from LEAP-Q, a standardized questionnaire
to evaluate language skills of bilinguals (Marian et al., 2007), was included.
(3) No feedback was delivered on the translation test or the feedback
questionnaire.
3.3.2 Pilot study 2
3.3.2.1 Participants
The two raters also filled out a background questionnaire. They are both female,
45 (R2) and 54 (R1) years old. Both raters acquired DSGS from birth through Deaf
family members and use DSGS both in their private and work context. They have been
living in the German-speaking part of Switzerland since birth and use mainly the DSGS
dialect of Zurich but also understand the other dialects. Both are trained sign language
instructors and have been teaching different groups of learners (beginning learners as
well as sign language interpreting students) for over 30 (R1) and nearly 20 years (R2),
respectively. R1 is a lecturer in the sign language interpreting program at the University
of Applied Sciences for Special Needs Education (HfH) and has also been involved in
many different DSGS research projects over the last 20 years. R2 works also as research
assistant at the HfH and for the Swiss Deaf Association.
3.3.2.2 Procedure
The goal of the pilot study with the raters as participants was to gather feedback
on the use of the scoring instrument for the translation test. The raters received the video
recordings of the test takers (translation test) and were asked to score them and provide
feedback for improving the scoring instrument. Prior to the scoring, the raters received
a short rater training on the use of the scoring instrument. Both raters were familiar with
the criteria of the scoring instrument from their use of it in the SMILE project.
!
34
3.3.2.3 Result
The two raters provided some feedback on the scoring instrument, most of which
was related to its wording. For example, they suggested changing the wording
“correct/not correct” to “Is the sign correct?”. The scoring time varied between 15-20
minutes per test taker.
3.4 Main Study
Following the pilot study, the procedure and instruments were revised for the
main study.
3.4.1 Participants
Of the 20 test takers, 5 were male and 15 female. The test takers were between
24 and 55 (M = 39.3) years old at the time of testing. 19 of the 20 test takers were
hearing; one was wearing a Cochlear implant and had acquired German as her first
language but was also learning DSGS as an adult. The majority of the test takers had a
spoken language (e.g., a Swiss German dialect or High German; n = 18) as their L1.
Two participants reported having grown up with two spoken languages. All participants
had learned DSGS as adults (range: 18-53 years old, M = 35.4).
The independent variables of this study are (1) self-assessment of DSGS skills
(reception and production), (2) number of DSGS courses attended, and (3) contexts of
DSGS learning (whether only in courses or also in other contexts). The first variable,
self-assessment, was measured on a 10-point scale, ranging from “very low” (1) to
“perfect” (10). The test takers assessed their receptive DSGS skills as ranging between
1 and 8 (M = 4.30, SD = 2.36) and their productive skills as ranging between 1 and 7 (M
= 4.10, SD = 2.15). Regarding the second variable, the test takers reported having
attended different numbers of courses, ranging from one beginning course to a total of
seven different courses (n = 17, 3 missing values; Table 2). Each course amounted to
approximately 30 contact hours in class.
Table 2
Number of Courses Attended by Test Takers (n = 17, 3 Missing Values)
Number of Courses
Frequency
Percent
1
3
17.6
!
35
2
1
5.9
3
5
29.4
4
2
11.8
5
2
11.8
6
2
11.8
7
2
11.8
Total
17
100.0
As a third variable, test takers had been asked where they had learned DSGS.
The majority had learned DSGS through courses only (n = 13); others (n = 7) had
learned the language through a combination of courses, work, vocational
training/studying, and friends. For the data analysis, the responses given were coded
into two categories, “sign language courses only” and “sign language courses and other
contexts (e.g., work, Deaf friends)”.
3.4.2 Procedure
The general procedure was identical to that of the pilot study and will therefore
not be repeated here. The changes to the instruments were discussed in Section 3.3.1.3.
3.5 Data Analysis
The data collected for this project comprise:
(1) Test scores of the YN test (interval-level data)
(2) Test scores of the translation test (interval-level data)
(3) Questionnaire data (nominal- and ordinal-level data)
(4) Feedback questionnaire (mostly ordinal-level data)
To answer the research questions as laid out in Section 2.6, the statistical
procedures described below were applied.
3.5.1 Statistical assumptions
In a first step, the researcher checked whether the test scores (YN and translation
test) were normally distributed. Bachman (2004) suggests as a rule of thumb dividing
the skewness and kurtosis value by their respective standard error to check if the data
are normally distributed. If the value is between +2 and -2, the data can be considered
!
36
normally distributed, which allows for the use of parametric statistical methods for
further analysis, provided the requirements of parametric methods regarding the level
of measurement of the data are also met (Green, 2013). Analysis of the scores of the YN
and translation test showed all values to be within the range of +/- 2; hence, the data was
assumed to be normally distributed. Because of this, and because the data were of
interval level, parametric statistical procedures were applied (Appendix M1). The
normal distribution of the three sets of score distributions was further confirmed with a
test of normality using the Kolmogorov-Smirnov statistics (Pallant, 2016). The results
were not statistically significant, indicating that the samples were normally distributed
(Appendix M2). A visual inspection of the Normal Q-Q Plot confirmed these findings
(Appendix M3). No outliers were reported in the three score distributions.
A statistical significance level of .05 (2-tailed) was used, as is often applied in
social sciences (Dörnyei, 2007; Pallant, 2016).
For the YN test, the adjusted scores (“true scores”) served as the basis for all
analyses. For all analyses that involve the YN test score, one study participant was
excluded, as she/he had to terminate the YN test after the first 40 items due to technical
difficulties. The scores of the translation tests of Rater 1 and Rater 2 were treated
separately.
Nonparametric statistics were chosen for investigating inter-rater reliability of
the translation test, which was nominal level data (Gwet, 2014; Pallant, 2016), and to
compute correlations that involved the ordinal-level data described above (Pallant,
2016).
In addition, it was decided to follow Cohen’s (1992) scale for determining the
strength of the relationship of a correlation: (1) .10 to .29 as small, (2) .30 to .49 as
medium, and (3) .50 to 1.0 as large. Furthermore, the coefficient of determination, i.e.,
how much shared variance the two variables have in common, was calculated by
squaring the r or rs value (Pallant, 2016).
The analysis of effect size for a paired-samples t-test followed Cohen’s (1988)
scale of .01 as small effect, .06 as moderate effect, and .14 as large effect by applying
the eta squared formula (cited in Pallant, 2016, p. 253).
!
37
3.5.2 Descriptive statistics of the test scores
In the pilot phase of test development, it is important to look at the overall
performance of a test (Alderson et al., 1995). Therefore, in a first step, the descriptive
statistics of the YN test and translation test scores are reported.
3.5.3 Item analysis of the test scores
Item analysis is one of the first statistical analyses applied in test development
(Green, 2013). The primary goal of the item analysis for the work reported here was to
obtain information on (1) the level of difficulty of the test items (facility value or FV)
and (2) the discrimination of the items, i.e., how well the items differentiated between
test takers with different levels of ability (Alderson et al., 1995; Osterlind, 2001).
The FV was investigated by calculating the item mean for each individual item.
The highest possible value of FV is +1.0, indicating that 100% of all test takers
performed correctly on the respective item; the lowest value is 0, indicating that no test
taker performed correctly on the respective item. Bachman (2004) suggests as a rule of
thumb selecting items with an FV between .20 and .80. Since the sample of the study at
hand is quite small and heterogeneous (as can be seen from the range of the raw scores),
items from .20 to .90 were retained in the item set in order not to reduce the set too much
at this stage of test development.
Discrimination of items can be investigated by calculating the corrected item-
total correlation (CITC). The CITC values range on a scale from +1.0 to -1.0. A value
of +1.0 indicates that test takers with a high overall score are likely to get that particular
item right and test takers with a low overall score are likely to get that particular item
wrong (Lienert & Raatz, 1998; Osterlind, 2001; Rust & Golombok, 2000). A negative
value indicates that better test takers are likely to answer an item incorrectly, whereas
test takers with lower scores are likely to get it right (Green, 2013). There is no definitive
threshold for the CITC value with respect to item selection; some authors suggest a
CITC of >.30 (Carr, 2011; Green, 2013) or >.40 (Alderson et al., 1995). Popham (2000)
provides an overview regarding different levels of discrimination, according to which
even items with a CITC value between .30 and .39. are “reasonably good but possibly
subject to improvement” (p. 324). For the purpose of this study, a CICT value of >.30
was defined as threshold. Items that fall outside the defined range for FV or CITC were
removed.
!
38
3.5.4 Evidence of reliability in the two tests
For the purpose of this study, the internal consistency (assessed with Cronbach’s
Alpha) was investigated. For the YN test, Cronbach’s Alpha was calculated for the
scores of the real signs and the nonsense signs (e.g., Mochida & Harrington, 2006;
Stubbe, 2012). The “no sign produced” category of the translation test was recoded into
“wrong” for this analysis.
In order to investigate inter-rater reliability of the two independent ratings,
Cohen’s Kappa was applied due to the nominal-level nature of the data (Cohen, 1960;
Gwet, 2014). Cohen’s Kappa is an “estimate of the coding agreement that occurred
beyond what would be expected to occur by chance” (Brown, 2016, p. 143).
Occurrences of “no sign produced” were treated as missing values in order to get a
conservative picture of the agreement between the two raters (recoding such zero
productions as “wrong” would inflate the agreement value). Landis and Koch (1977)
propose a benchmark scale to determine the strength of agreement of the Kappa statistics
(Table 3).
Table 3
Landis and Koch Kappa Benchmark Scale (from Gwet, 2014, p. 166)
Kappa Statistics
Strength of Agreement
< .0
poor
.0 to .20
slight
.21 to .40
fair
.41 to .60
moderate
.61 to .80
substantial
.81 to 1.0
almost perfect
Also investigated was rater severity (or, strictness) (e.g., McNamara, 1996). To
investigate whether the two raters differed in their average rating behavior, their ratings
across the 20 test takers were compared. For this purpose, a paired-samples t-test was
applied (e.g., Pallant, 2016).
3.5.5 Evidence of validity in the two tests
There are different kinds of validity. For the current study, content and face
!
39
validity were investigated.
Content validity was investigated based on the fact that the items were sampled
from a pool of pre-existing DSGS teaching materials. This will be elaborated in the
discussion (Section 5.3.1). To investigate face validity, the test takers’ feedback was
evaluated. The feedback will be presented mostly descriptively, but also set in relation
to their test scores.!
3.5.6 Variables contributing to the performance of the test takers in the
Yes/No Test and Translation Test
For the YN test, the researcher followed Stubbe (2015)’s proposal to adjust the
test scores by calculating the true score, i.e., the hit rate minus the false alarm rate.
Due to the absence of standardized vocabulary for DSGS, comparable to that
used for validated vocabulary tests for English (e.g., Vocabulary Levels Test; Nation,
2001), that could serve to establish concurrent validity (e.g., Davies et al., 1999),
external variables obtained from the background questionnaire were identified as
potentially contributing to the test takers’ performance. For this reason, the YN test
scores and the translation test scores (separately for Rater 1 and Rater 2) were set in
relation to three variables from the questionnaire. Depending on the level of
measurement, a Pearson product-moment correlation or a Spearman rank order
correlation was applied. If a correlation was statistically significant, its strength as well
as the shared variance was evaluated (Pallant, 2016). The three variables are the
following:
(1) DSGS self-assessment scores: The receptive (YN test) and productive
(translation test) self-assessment scores ranged on an ordinal scale (10-point
scale). Here, a Spearman rank-order correlation was applied.
(2) DSGS courses attended: The test takers’ responses were coded into seven
different categories, ranging from one to seven courses attended. Here, too,
a Spearman rank-order correlation coefficient was computed.
(3) DSGS learning contexts: The test takers’ responses were coded into two
variables, as mentioned before, “DSGS courses only” and “DSGS courses
and other contexts (work, training, friends)” (nominal data). The latter
category implies more opportunities for DSGS learning. To investigate
whether this variable contributed to the performance of the test takers, a
Pearson product-moment correlation was calculated (Pallant, 2016).
!
40
3.5.7 Comparison of the performances of the test takers on both tests
To investigate whether there was a relationship between the self-reported
knowledge as present in the YN test and the verified scores of the translation test
(interval data), a correlation was computed using a Pearson product-moment correlation.
The goal here was to see to what degree the self-report of the YN test correlated with
the translation test scores.
3.5.8 Nonsense signs of the Yes/No test
Central to the evaluation of the YN test was also to check how the test takers
performed on the nonsense signs. Here, only the hit rate and the false alarm rate will be
reported descriptively.
!
!
41
4. Results
This section presents the results of the main study described in Section 3.4 (recall
that the results of the pilot study were presented in Section 3.3). Section 4.1, the
descriptive statistics will be reported for both tests; Section 4.2 introduces the results of
the items analysis; Section 4.3 presents evidence of reliability; Section 4.4 discusses
validity issues of the two tests; Sections 4.5 and 4.6 discuss the relationship between
external variables and the test scores; Section 4.7 reports from the relationship between
the YN test scores and the translation test scores; and, finally, Section 4.8 presents of
the nonsense signs in the YN test.
4.1 Descriptive Statistics of the Test Scores
4.1.1 Descriptive statistics of the YN Test scores
The maximum possible score of the YN test is 98. The difference (range: 81)
between the minimum and the maximum of the actual test scores is quite large
(minimum: 17; maximum: 98; Appendix N1), which can be explained by the
heterogeneous nature of the participants. Since the test scores represent interval-level
data, the three measures mean, median, and mode all present meaningful information.
The mean (M = 64.82), the average score of all test takers, and the median (Mdn = 66.4),
the middle-ranked score of the test takers, are close to each other. The mode (Mo = 90)
denotes the most frequently occurring score among the test takers (Green, 2013), which
is quite high here. The standard deviation (SD = 24.15) describes how the test takers’
scores vary from the mean and is used to indicate the distribution of the scores (Alderson
et al., 1995; Green, 2013). The SD, too, is rather high (e.g., Field, 2009). (See also
Appendix N2 for the Histogram and Appendix N3 for the individual test scores of the
test taker).
4.1.2 Descriptive statistics of the Translation Test scores
The descriptive statistics of the DSGS vocabulary translation test (maximum
possible score: 98) are presented here separately for Rater 1 and 2 (Appendix N4). For
Rater 1, difference (range: 82) between the minimum and the maximum is quite large
(minimum: 12; maximum: 94), which, as in the case of the YN test, can be explained
by the heterogeneous nature of the participants. The mean (M = 53) and the median
(Mdn = 51) are close to each other. There are four different modes; the lowest value is
!
42
44, which is not very high. The standard deviation (SD = 22.96) is high. (See also
Appendix N5 for the Histograms.)
For Rater 2, the difference (range: 76) between the minimum and the maximum
score is less pronounced than for Rater 1 (minimum: 6; maximum: 82). As for Rater 1,
the mean (M = 47.70) and the median (Mdn = 49.50) are close to each other. The lowest
value of the mode is 39 (Mo = 39), which is not very high, but here, too, there are four
modes. The standard deviation (SD = 20.73) is slightly lower than for Rater 1. Looking
at the distribution of scores, it can be seen that Rater 1 assigned higher scores than Rater
2. This will be discussed in more detail below.
4.2 Item Analysis of the Test Scores
In the following subsections, the results of the item analysis will be presented
separately for the YN test and the translation test.
4.2.1 Item analysis of the YN Test scores
For the item analysis, only the actual signs (i.e., not the nonsense signs) were
evaluated. From the test scores of the 19 test takers, the data of one additional test taker
was removed automatically by SPSS owing to the occurrence of one missing value that
was due to technical problems of test delivery (see Section 3.1.2) (n = 18). The FV of
the items ranged from .22 to 1.00, indicating a broad range of item difficulty. Four items
were checked as “yes” by all test takers. These items were removed from the test, as
were items with an FV of above .90, leaving 23 items.
As for the CITC, 13 items were not within the defined range (see Section 3.5.3);
10 of them had already been removed because of their FV value that was too high. This
gave a total of 26 items to be omitted. The results are presented in Appendix O1.
4.2.2 Item analysis of the Translation Test scores
For the item analysis of the translation test scores, separately for Rater 1 and 2,
“no sign produced” occurrences were recoded into “wrong”.
The FV of the items evaluated by Rater 1 ranged from .15 to 1.00. Applying the
same criteria as above (.20 to .90), four items call for removal, two with an FV of .15
and two with an FV of 1.00. As for the CICT value of <.30, 16 items would need to be
removed. Four of them fall also into the defined range of .20 to .90 for the FV. In total,
16 items were candidates to be removed from the test (Appendix O2).
!
43
The facility value of the items evaluated by Rater 2 ranged from .0 to .90.
Applying the same criteria as above, eight items would need to be removed, with a range
from 0 to 0.15. Looking at the CICT value and a threshold of <.30, 24 items would need
to be removed. Six of them also fell into the defined range of removal for the FV. In
total, 26 items were identified which did not meet the FV and/or CITC criterion
(Appendix O2).
The results of the item analysis of the YN test scores and the translation test
scores will be discussed in Section 5.1.
4.3 Evidence of Reliability in the Two Tests
For this study, the internal consistency of the two tests and the inter-rater
reliability of the translation test were investigated.
4.3.1 Internal consistency of the two tests
Since the data under consideration are of binary nature, the Kuder-Richardson
20 formula was applied (KR-20; Bühner, 2006; Lienert & Raatz, 1998). The preset
Alpha model for calculating Cronbach’s Alpha was applied using the KR-20 formula
with SPSS (Bühner, 2006; Table 4). One property of Cronbach’s Alpha is that it
increases with the number of items (Bortz & Döring, 2005).
Table 4
Cronbach’s Alpha in the YN Test and the Translation Test Scores
Test
Cronbach’s Alpha
YN Test: Real signs, 98 items (n = 18*)
.980
YN Test: Nonsense signs, 25 items (N = 19)
.760
Translation Test, Rater 1, 98 items (N = 20)
.970
Translation Test, Rater 2, 98 items (N = 17*)
.961
*Listwise deletion based on all variables in the procedure
For the YN test, the Cronbach’s Alpha value for the real signs can be considered
high (
a
!=!.980), for the nonsense signs, acceptable. In the data set of the real signs, one
additional test taker was removed due to a missing value. The Cronbach’s Alpha for the
translation test can also be considered high, with a small difference between Rater 1 (
a
!
!
44
=!.970) and Rater 2 (
a
!=!.961).
4.3.2 Inter-rater reliability of the Translation Test
Cohen’s Kappa was applied to investigate inter-rater reliability (Gwet, 2014). In
a first step, the researcher checked whether the raters were consistent in marking zero
productions of signs as such. There were 54 occurrences of “no sign produced” (out of
1,960: 98 items
$
20 test takers) across 14 test takers that were not marked as such by
both Rater 1 and 2. In a first step, the researcher reanalyzed the video recordings of these
54 occurrences. An exchange between the two raters led to the following solution: The
researcher and Rater 2 had understood “no sign produced” to mean something which
occurred if the test taker did not raise his or her hands at all, while Rater 1 also checked
this category when (1) a test taker attempted to produce something and stopped mid-
way in the air or (2) produced some form of gesture. It was agreed to categorize these
two latter kinds of production as “wrong” (instead of “no sign produced”). There were
no occurrences of items not rated in the data.
Of the 1,960 ratings, 20.6% (n = 403) were missing values (i.e., “no sign
produced”, see Section 3.5.4 for the decision to recode these instances as missing
values), the remaining 79.4% (n = 1,557) were either “right” or “wrong”. With two
raters and two possible decisions, there are four options: (1) cases for which both raters
agreed on “right”, (2) cases for which both raters agreed on “wrong”, (3) cases for which
Rater 1 assigned “right” and Rater 2, “wrong”, and (4) cases for which Rater 2 assigned
“right” and Rater 1, “wrong”. This is visualized in a four-cell contingency table (Table
5).
Table 5
Agreement and Disagreement of the Two Raters on the Translation Test
Rater 2
Total
0
1
Rater 1
0
406
79
485
1
197
875
1,072
Total
603
954
1,557
Brown (2016) recommends reporting the agreement coefficient (percentage of
!
45
agreed ratings) along with the Cohen’s Kappa (or
%
). The agreement coefficient was
.823, which means that the two raters agreed in 82.3% of all cases as to whether a sign
was produced correctly or incorrectly and, consequently, had 17.7% disagreement. The
Kappa coefficient ranges from to .00 to +1.00 (in extreme cases, it can also be negative,
namely when agreement is smaller than expected by chance). The agreement between
the two raters on the translation test was
%
= .613, p <.001, which is considered
“substantial” according to Landis and Koch (1977).
By recording all missing data aswrong”, as will be done in an authentic testing
scenario in the future, both the agreement coefficient and the Kappa value increase,
namely to .86 (86% agreement between coders) and
%
= .719, p <.001, respectively. The
Kappa value is still within the category of “substantial” strength of agreement between
the raters (Landis & Koch, 1977).
To investigate the raters’ strictness, a paired-samples t-test was applied. Using
the raw scores as the basis for the comparison, there was a statistically significant
difference between the ratings of Rater 1 (M = 53.6, SD = 22.96) and Rater 2 (M = 47.7,
SD = 20.73), t(19) = 5.871, p <.001 (2-tailed). The mean difference was 5.9 with a 95%
confidence interval ranging from 3.79 to 8.0. The effect size was calculated using the
eta squared formula (Pallant, 2016):!
&'( ) *+
*+,-./01
This resulted in a value of .644, which qualifies as a “strong” effect according to Cohen
(1988, in Pallant, 2016, p. 253), indicating a large magnitude of the difference between
the raw scores of Rater 1 and Rater 2. More precisely, Rater 1 evaluated the test takers
less strictly than did Rater 2.
4.4 Feedback from Test Takers
The feedback from the test takers will be presented mainly descriptively but will
also be set in relation to the test takers’ scores. Test takers received the following set of
four statements and two questions relating to the YN test and the translation test:
(1) The wording of the test instruction is clear.
(2) The wording of what the test intends to measure is clear.
(3) The test matches my DSGS skills well.
(4) The length of the test is appropriate.
!
46
(5) How to you judge the potential of this test to evaluate your own DSGS
vocabulary learning?
(6) Which of the two tests do you prefer?
Statements (1) to (4) were rated on a four-point Likert scale (“strongly disagree”,
“disagree”, “agree”, “strongly agree”). Question (5) was answered with “negative”,
“rather negative”, “rather positive”, or “positive”. Question (6) provided the possible
responses “YN test”, “translation test”, “no preference”, and “don’t know”.
Additionally, test takers were given the possibility to leave a written comment. The
results are displayed in Appendix P.
Descriptively, the majority of the test takers agreed or strongly agreed that the
wordings as well as the intended goals of both tests were clear. For the third statement,
dealing with the question of whether the levels of the two tests matched the test takers’
DSGS skills, there was also more agreement than disagreement, something that will be
set in relation to the test scores (see below). As for the lengths of both tests (recall that
the YN test took approximately 20 to 25 minutes to be completed, the translation test,
10 minutes), most of the test takers agreed or strongly agreed that the length was
appropriate.
The potential of either test to evaluate DSGS vocabulary learning was responded
to by the majority of test takers with “positive” or “rather positive”. Asked about a
preference, eight test takers had no preference, eight preferred the translation test, and
four the YN test.
Table 6
Correlation of Statement 3 “The test matches my DSGS skills well” and Test Scores, by
Test
YN Test
(N = 19)
Translation Test,
Rater 1
(N = 20)
Translation Test,
Rater 2
(N = 20)
Correlation
rs
p
rs
p
rs
p
.557
.003*
.484
.031*
.546
.013*
*statistically significant at the .05 level (2-tailed)
!
47
Of interest was also to see how the self-reported judgment as to the extent to
which the two tests matched the test takers’ own DSGS skills (Statement 3 above) can
be set in relation to the test scores (Table 6). A Spearman rank-correlation coefficient
was applied to answer this question.
The correlation between the test takers’ judgment of whether the test level
matched their own DSGS skills and the YN test scores can be considered strong (>.50),
indicating that the more test takers agreed that their DSGS skills matched the level of
the test, the higher their YN test scores were. The shared variance between these
variables was .31 or 31%. The correlation between the test takers’ judgment of whether
the translation test level matched their DSGS skills was moderate and strong (Rater 1
and Rater 2, respectively). This means that the more the test takers agreed that that their
own DSGS skills matched the translation test level, the higher their scores were, both
for Rater 1 and Rater 2. The correlation between the test takers’ judgment and the ratings
of the translation test with Rater 1 had .234 or 23.4% shared variance, with Rater 2, .298
or 29.8%.
4.5 Evidence Contributing to the Vocabulary Tests’ Validity: Performance of the
Test Takers on the Yes/No Test in Relation to External Variables
4.5.1 Test takers’ self-assessment in relation to YN Test scores
To investigate whether the self-attributed DSGS level (N = 19) contributed to
the YN test scores, a Spearman rank-order correlation was computed owing to the level
of measurement (self-assessment: ordinal data). For the self-assessment, a ten-point
scale was used, “1” representing the lowest and “10” the highest value. The test takers
assessed their receptive DSGS skills ranging from 1 to 8 (M = 4.30, SD = 2.36).
The Spearman rho value, rs = .241, N = 19, p = .319 (2-tailed), indicated no
statistically significant relationship between the YN test scores and the receptive DSGS
self-rating.
4.5.2 Test takers’ number of courses attended in relation to YN Test scores
To investigate whether the number of courses attended, as coded into seven
different categories (ordinal data), contributed to the YN test scores (n = 17), a
Spearman rank-order correlation coefficient was calculated.
The results yielded a statistically significant relationship between the number of
courses attended and the YN test scores (rs = .528, n = 17, p = .036, (2-tailed); see
!
48
Scatterplot Appendix Q1). In other words, the more courses the test takers had attended,
the higher their scores on the YN test were. The amount of shared variance was .278 or
28% between the two variables, which was not high. The strength of the correlation can
be considered as strong (>.50).
4.5.3 Test takers’ contexts of learning in relation to YN Test scores
In order to investigate whether the contexts in which test takers (N = 19) learned
DSGS (nominal data) contributed to the YN test scores, a Pearson product-moment
correlation was applied.
With a resulting value of r = .038, N = 19, p = .877 (2-tailed), there was no
statistically significant relationship between the context of learning and the scores of the
YN test.
4.6 Evidence Contributing to the Vocabulary Tests’ Validity: Performance of the
Test Takers on the Translation Test in Relation to External Variables
4.6.1 Test takers’ self-assessment in relation to Translation Test scores
In order to investigate whether the self-rating of the productive DSGS skills
contributed to the translation test scores as obtained from Rater 1 and 2, a Spearman
rank-order correlation coefficient was calculated (Table 7). The test takers assessed their
productive skills with values ranging between 1 and 7 (1 to 10 possible) (M = 4.10, SD
= 2.15).
Table 7
Correlation Translation Test Scores and Self-Assessment, by Rater (N = 20)
Rater
Spearman’s rho (rs)
p
Rater 1
.384
.095
Rater 2
.409
.074
Here, too, the correlation of translation test scores from Rater 1 and Rater 2 and
the self-assessment of the test takers’ DSGS skills was not statistically significant.
!
49
4.6.2 Test takers’ number of courses attended in relation to Translation Test
scores
To investigate if the number of courses attended can contribute to the translation
test scores, a Spearman rank-order correlation coefficient was calculated (Table 8).
Table 8
Correlation Translation Test Scores and Number of Courses Attended, by Rater (n =
17)
Rater
Spearman’s rho (rs)
p
Rater 1
.581
.014*
Rater 2
.601
.011*
*statistically significant at .05 level, 2-tailed.
Both with Rater 1 and Rater 2, there is a statistically significant relationship
between the variables “number of courses attended” and “translation test scores” (see
Scatterplots Appendix Q2 and Q3). This means that the more courses the test takers had
taken, the higher their scores were. The correlation was slightly higher with Rater 2 than
with Rater 1.
Both correlations can be considered strong (>.50). The shared variance of the
two variables with Rater 1 was .337 or 33.7%, and with Rater 2, .361 or 36.1%.
4.6.3 Test takers’ contexts of learning in relation to Translation Test scores
To investigate whether the contexts in which test takers learned DSGS
contributed to the translation test scores, a Pearson product-moment correlation was
computed (Table 9).
Table 9
Correlation Translation Test Scores and Contexts of Learning, by Rater (N = 20)
Rater
Pearson’s r
p
Rater 1
.135
.571
Rater 2
.151
.525
!
50
Similar to the YN test scores, there was no statistically significant correlation
between the translation test scores obtained from Rater 1 and 2 and the contexts in which
DSGS was learned.
4.7 Comparison of the Performances of the Test Takers on both Vocabulary Tests
To investigate whether there was a relationship between the YN test scores (self-
assessment) and the scores of the translation test, a correlation was calculated, namely
a Pearson product-moment correlation. The mean scores and percentages of the scores
of the YN test and the translation test by rater are reported below (Table 10).
Table 10
Mean Scores and Percentages of YN Test and Translation Test (max. Score on Both
Tests = 98)
YN Test
(N = 19)
Translation Test, Rater 1
(N = 20)
Translation Test, Rater 2
(N = 20)
Score
%
Score
%
Score
%
64.82
66.14
53.60
54.69
47.70
48.67
The correlation between the YN test score and the translation test score is given
below in Table 11 (see also Scatterplots Appendix R1 and R2).
Table 11
Correlation YN Test Scores and Translation Test Scores, by Rater (N = 19)
Rater
Pearson’s r (r)
p
Rater 1
.811
.001*
Rater 2
.867
.001*
*statistically significant at .001 level, 2-tailed.
Both correlations were strong (>.50), i.e., the higher test takers scored on one of
the tests, the higher they scored on the other test. The shared variance of the two
variables was .657 or 65.7% for Rater 1 and .752 or 75.2% for Rater 2.
!
51
4.8 Nonsense Signs of the YN Test: False Alarm Rate
The frequency distribution of the nonsense signs was analyzed (Appendix S).
Seven test takers did not claim to know any of the nonsense signs (correct rejections),
the remaining test takers claimed to know between one and ten nonsense signs (false
alarms).
The hit rates of the real items (N = 98) and the nonsense signs (N = 25) are
displayed below. The false alarm rate of 9.26% (ranging from 0 to 10 signs) was quite
high (Table 12).
Table 12
Hits and False Alarms of the YN Test (N = 19)
Hits (max. Score = 98)
False Alarms (max. Score = 25)
M Scores
64.82
2.32
Rate (in %)
66.12%
9.26%
Range of Scores
17-98
0-10
SD
24.15
2.65
!
52
5. Discussion
This section discusses the results presented in Section 4 in relation to the
research questions. Section 5.1. discusses the results of the item analysis (Research
Question 1a), Section 5.2., evidence of reliability of the two tests (Research Question
1b), Section 5.3, validity issues (Research Question 1c), Section 5.4, the performance
of the DSGS learners on the two tests (Research Question 2), and Section 5.5, the
efficiency of the nonsense signs in the YN test (Research Question 3).
5.1 Item Analysis of the YN Test and Translation Test
By conducting an item analysis for the YN test and the translation test (see
Section 4.2), the researcher identified items that were either too easy or too difficult
(FV: .20-.90) and did not differentiate well between test takers with higher or lower
scores (CITC >.30) (e.g., Bachman, 2004; Carr, 2011). The results of the item analysis
can be found in Appendix O1 and O2.
In a first step, the FV and CITC criteria were applied separately for Rater 1 and
Rater 2. This resulted in a different number of items that were candidates for removal:
16 for Rater 1 and 26 for Rater 2. As the translation test is still under development, the
researcher decided that only items for which at least one of the criteria for removal
defined above was met by both raters were considered to be discarded. This resulted in
a total of twelve items (Table 13).
Table 13
Items to be Removed from the Translation Test and YN Test
Items
(DSGS gloss)
Translation
FV
R1
CITC
R1
FV
R2
CITC
R2
FV
YN
Test
CITC
YN
Test
P1 WIDERSPRUCH
‘contradiction’
.15
.281
.15
.274
N/A
N/A
I7 SCHÜTZEN
‘to protect’
.2
.266
.15
.12
N/A
N/A
I14 TELEFONIEREN*
‘to call’
1.0
.0
.55
.277
1.0
.0
I25 SAMMELN
‘to collect’
.55
.26
.45
.194
N/A
N/A
I27 BLAU*
‘blue’
.9
.242
.75
.159
1
0
I28 FREUND*
‘friend’
.9
.122
.8
-.095
.94
.254
I33 EI*
‘egg’
.65
.05
.5
-.079
.56
-.004
!
53
I36 MONAT
‘month’
.6
.22
.6
.207
N/A
N/A
I43 VERKAUF
‘sale’
.35
.246
.15
-.143
N/A
N/A
I52 FRAGEN*
‘to ask’
.85
.171
.85
.185
.94
.116
I61 FARBE*
‘color’
.8
.13
.8
.171
.94
-.203
I84 ABEND*
‘evening’
1.0
.0
.9
-.028
1.0
.0
*Items where at least one of the two criteria occurred across the translation test scores
(Rater 1 and Rater 2) and the YN test scores
For only for one item (Practice Item 1, WIDERSPRUCH (‘contradiction’)), the
criteria for FV and CITC were outside the defined range across both raters. The
remaining items were either too easy (e.g., Rater 1: Item 14, TELEFONIEREN (‘to
call’)) or too difficult (e.g., both raters: Item 7, SCHÜTZEN (‘to protect’)), or the items
did not discriminate well between test takers with different levels of achieved scores
(CITC <.30).
In a second step, the previously identified 12 signs from the translation test were
checked with the 26 YN test items that were candidates for removal. Given the
developmental stage of both tests and the fact that they should have the same set of
items, the researcher decided to remove only items that met at least one of the two
criteria in both the translation and the YN test scores. This resulted in a total number of
seven items, marked with an asterisk in Table 13. For future versions of both tests, the
order of the items will be changed according to the results of the level of difficulty
(balanced between Rater 1 and 2 in the translation task and checked with the FV of the
YN test scores), starting with easier items. One issue that came up during the item
analysis was the different rating behavior of two raters, i.e., the different FV values for
some items for the two raters. For example, the item TELEFONIEREN (‘to call’)
exhibited an FV value of 1.0 for Rater 1 and an FV value of .55 for Rater 2. This is an
important first indication of a difference in rating behavior, though only on the item
level and descriptively.
5.2 Evidence of Reliability in the Two Tests
5.2.1 Measure of internal consistency: Cronbach’s Alpha
Concerning the internal consistency of the two tests, the reported Cronbach’s
Alpha value for the real signs of the YN test (
a
!=!.980) and the translation test for Rater
!
54
1 and 2 (
a
!= .970 and
a
!=! .961, respectively) indicate a high internal consistency.
Cronbach’s Alpha has been reported for pseudowords in YN tests for spoken languages
(e.g., Mochida & Harrington, 2006, p. 85:
a
= .37 to .45), the Cronbach’s Alpha of the
nonsense signs in the YN test at hand had an acceptable value (
a
!=!.760), which might
be due to the smaller number of items (25 nonsense signs).
5.2.2 Inter-rater reliability: Cohen’s Kappa
A result of investigating inter-rater reliability for the scoring of the translation
test, was that the overall agreement between Rater 1 and Rater 2 was
%
= .613, p <.001.
Here, the researcher chose a more conservative approach to get a realistic picture of the
agreement between the two raters by coding “no sign produced” instances as “missing
values” (instead of as wrong”). According to the strength-of-agreement benchmark
scale of Landis and Koch (1977), the agreement can be considered “substantial”. This
is acceptable for the purpose of this study; however, it is desirable to have a higher
agreement value in future studies. The results of the paired-samples t-test (Section 4.3.2)
confirmed what was already obvious from the item analysis (facility values of individual
items), i.e., that Rater 1 scored less strictly than Rater 2 on average. It was not possible
to determine the precise sources of the disagreement (and also not possible to include a
third rater), but possible explanations could be (1) the lack of a more intense rater
training (e.g., Fulcher, 2014; McNamara, 1996) and/or (2) insufficient clarity as to the
concept of correctness of a sign, which also touches on the ongoing discussion of what
acceptable variants are (Ebling et al., submitted). Both issues should be investigated
further in the future by applying a multi-faceted Rasch analysis (e.g., McNamara, 1996).
At first glance, a “correct” orincorrect” decision on the single-sign level does
not seem to pose a great challenge, as judging the correctness of a signed utterance
would. However, as mentioned earlier, there is to date no consensus as to what an
acceptable phonetic variant of a sign is. This might be influenced by the fact that sign
languages are not standardized languages (Adam, 2015). The previously mentioned
study of Ebling and colleagues (submitted) on acceptable variants in DSGS will help to
shed more light on this topic. This will in turn also help to get a clearer definition of the
proposed criteria of correctness (Bachman & Palmer, 1996).
In the future, one would also need to decide whether a combined test score
should be reported if a certain minimum of agreement between two or more raters has
!
55
been reached, and how the scores will be communicated to the test takers. Intra-rater
reliability should also be investigated in the future.
5.3 Evidence of Validity in the Two Tests
5.3.1 Content validity
Regarding Research Question 1c, the researcher argues that content validity was
established through the way the items were sampled. The list of lexical DSGS signs,
compiled within the SMILE project, was sampled from well-established DSGS teaching
materials (Boyes Braem, 2004a, 2004b, 2005a, 2005b) that are used for beginning
learners in different DSGS learning contexts in German Switzerland, and to which
specific linguistic criteria were applied to reduce the number of items from around 3,800
to 98 (Boyes Braem et al., 2016). This provided a solid basis for developing the two
tests here. Even though the aspects of frequency and part of speech could not be
accounted for due to the lack of a corpus and sufficient research for DSGS, respectively,
the list used represents signs that beginning learners are exposed to in their DSGS
classes. However, the absence of a DSGS corpus poses a big problem in the long term,
when the items will eventually need to be modified or replaced. A next step would be
to let a larger group of experienced Deaf sign language instructors to rate a larger set of
items’ levels of difficulty in order to enable future sample of items from different levels
of difficulty.
5.3.2 Face validity
Face validity involves, for example, a holistic feedback from test takers or other
user groups (Alderson et al., 1995). The test takers’ feedback from this study can thus
be used as a source of face validity (e.g., Sato & Ikeda, 2015). In the present study, test
takers delivered very positive feedback on the two tests overall. Of special interest was
the question of whether the test takers also saw the level of the test as matching their
DSGS skills (which was also correlated with their test scores). Even though the feedback
from the test takers (Appendix P) was not very detailed, it supports the usefulness of
both tests, and the obtained data contributes to an overall validity of both vocabulary
tests. Further investigation on the preference of one of the tests over the other would be
of interest.
!
56
5.4 Test Performances on the Two Tests
5.4.1 External variables contributing to the test takers’ performance
As pointed out in the introduction, due to the lack of an existing DSGS
vocabulary test for adult learners that would allow for investigating concurrent validity,
external variables (self-assessment of DSGS skills, number of DSGS courses attended,
and DSGS learning contexts) were identified that might explain or contribute to the
overall test scores. Of these three variables, only the variable “number of courses
attended” exhibited a statistically significant correlation with the scores of both tests. A
possible explanation for the fact that the other two variables did not correlate with the
test scores is that a self-assessment is too difficult to undertake for the test takers due to
the lack of a reference. For example, “very good” DSGS skills might mean something
different for a beginning learner of DSGS than for a trained sign language interpreter.
The self-assessment rating might also be influenced by the level of self-criticism of the
learner. As for the learning context variable, it is one that is hard to quantify, i.e., what
exactly does using DSGS with friends or at work imply in terms of language exposure?
The number of courses is perhaps the most appropriate variable to correlate with the test
scores in that it is easy to quantify and also contains the notion of language exposure.
In addition, the applied statistics were only of correlational nature, which means
it is uncertain whether the number of courses is really the predictor variable to contribute
to the test scores. Nevertheless, the correlation between the number of courses attended
and the test scores represents evidence that an external variable can be used as argument
for concurrent validity. This method of approaching concurrent validity has previously
been applied in sign language testing by Haug (2011) and Mann (2006).
For future research, as long as there is no other test available, it would be possible
to include more items from the LEAP-Q questionnaire (Marian et al., 2007), which
contains more specific questions about language and makes quantification easier to
achieve.
5.4.2 Comparing test scores of both tests
The correlation of the test scores of the translation test and the YN test showed
promising results with a strong correlation (Rater 1: r = .811; Rater 2: r = .867), with a
shared variance of 65.7% and 75.2%, respectively, between the two variables. These
results are comparable to studies on English that correlated YN test scores with scores
of existing vocabulary tests, both with correlations (e.g., Anderson & Freebody, 1983;
!
57
Mochida & Harrington, 2006) and by computing the shared variance (Mochida &
Harrington, 2006). However, there are also studies for spoken language that provide a
different picture, i.e., weaker correlations (e.g., Cameron, 2002; Eyckmans, 2004).
5.5 Effectiveness of the Nonsense Signs
Based on the results, the false alarm rate at 9.26% (range: 0-10) was somewhere
in-between what has been reported for spoken language YN tests in the literature, from
a little over 4% (Stubbe, 2012) up to 25% (Eyckmans, 2004). In the future, it will be
interesting to investigate whether the false alarm rate is a function of over- or
underestimation with respect to the translation scores of the test takers.
The literature for spoken language YN tests provided a framework for
developing the YN test for DSGS, containing (1) the concept of pseudowords (e.g.,
Beeckmans et al., 2001), (2) suggestions for adjusting the YN raw scores by means of a
correction formula (e.g., Stubbe, 2015), and (3) a reference regarding the false alarm
rate (e.g., Stubbe, 2012). However, there is one crucial difference between nonsense
signs and pseudowords: In DSGS, the form and the meaning of a sign is composed of
both manual and non-manual components that are produced at the same time (Boyes
Braem, 1995), i.e., two different visual channels are used simultaneously to produce a
linguistic symbol. This use of simultaneity on different linguistic levels in sign
languages is quite different from how spoken languages are structured, especially at the
level of the isolated lexical item. Test takers in the pilot study reported that they applied
different strategies in cases where they did not understand a sign as a whole (i.e.,
information from all channels). It was only when the manual form was unknown to them
that they tried to retrieve meaning from the mouthing. Because of this, it is clear from
this study that more research is needed to determine whether the nonsense signs really
work the way they should for a YN test. One could pursue this question by letting L1
users of DSGS judge whether nonsense signs could potentially be DSGS signs under
the following conditions: (1) with no mouthing, (2) accompanied by a German
mouthing, and (3) accompanied by a mouthing from the original sign language (e.g., the
BSL sign KNOW and the English mouthing /know/). For future YN test uses, one could
also experiment with including the information in the instructions that there are some
signs in the test that are not actual DSGS signs.
!
!
58
6. Conclusion and Implications for Future Research
The goal of this project was to develop and evaluate two DSGS vocabulary size
tests framed as discrete, selective, and context-independent tests, one relying on a self-
report format (YN test) and one on a verifiable format (translation test), with both tests
to be used in the future in different DSGS learning contexts as a placement and/or
diagnostic instrument for beginning adult learners in German Switzerland. Developing
and evaluating a test of vocabulary knowledge for a language like DSGS that is under-
documented and under-resourced poses a number of methodological challenges, which
have been discussed in this dissertation. Despite these constraints, it was possible to
develop and evaluate the two tests.
The research questions laid out at the beginning of the dissertation were
answered and now serve as the basis for future research and application.
A limitation of the work reported in this dissertation was the sample size of the
main study, which at 20 was too small to allow for generalization of the findings. A
third rater would have been preferable to have more data at hand, and the rater training
was rather short. In addition, the effectiveness of the nonsense signs could not be
assessed exhaustively. Due to time constraints, it was also not possible to run a multi-
faceted Rasch analysis, which could have shed further light on the raters’ behavior.
The results of this study brought up further questions, which will be addressed
in the future. For example, which strategies do raters apply when they rate a sign? How
can disagreement between raters be solved? Another strand of research could focus on
the YN test, starting with developing and testing nonsense signs. Which of the suggested
formula for adjusting the YN test scores for spoken language tests could be used? Or,
on more theoretical grounds: Is it possible to define something similar to word families
for sign languages? How could this be linked to DSGS vocabulary assessment? In the
future, how can new items for a DSGS vocabulary test be sampled from the existing
teaching materials, as long as there is no DSGS corpus available? And, relating to
validation: Can further evidence of the overall validity of both tests be collected?
In sum, this was the first study in the field of sign language assessment that
explicitly addressed vocabulary assessment for adult learners of sign language not only
in Switzerland but internationally. Studies from the field of spoken language assessment
provided a framework for the development of some aspects of the two tests, and were
supplemented by studies from sign language linguistics. The results of this study
!
59
complement studies from the larger field of spoken language assessment and will
contribute to future research in sign language testing and assessment and its application.
!
60
7. References
Adam, R. (2015). Standardization of sign languages. Sign Language Studies, 15(4),
432–445. https://doi.org/10.1353/sls.2015.0015
Alderson, C., Clapham, C., & Wall, D. (1995). Language test construction and
evaluation. Cambridge: Cambridge University Press.
Anderson, D., & Reilly, J. (2002). The MacArthur Communicative Development
Inventory: Normative data for American Sign Language. Journal of Deaf Studies
and Deaf Education, 7(2), 83–106. https://doi.org/10.1093/deafed/7.2.83
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assessment
and acquisition of word knowledge. In B. Hudson (Ed.), Advances in
reading/language research: A research annual (pp. 231–256). Greenwich,
CT: JAI Press.
Arendsen, J. (2009). Seeing signs: On the appearance of manual movements in gestures.
Technische Universiteit Delft, Delft, The Netherlands.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachman, L. F. (2004). Statistical analysis for language assessment. Cambridge:
Cambridge University Press.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford: Oxford
University Press.
Battison, R. (1978). Lexical borrowing in American Sign Language. Silver Spring, MD:
Linstok Press.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van der Velde, H. (2001).
Examining the Yes/No vocabulary test: Some methodological issues in theory
and practice. Language Testing, 18(3), 235–274.
https://doi.org/10.1177/026553220101800301
Beglar, D., & Nation, P. (2013). Assessing vocabulary. In A. J. Kunnan (Ed.), The
companion to language assessment (pp. 172–184). Hoboken, NJ: John Wiley &
Sons, Inc.
Bochner, J. H., Samar, V. J., Hauser, P. C., Garrison, W. M., Searls, J. M., & Sanders,
C. A. (2016). Validity of the American Sign Language Discrimination Test.
Language Testing, 33(4), 473–495. https://doi.org/10.1177/0265532215590849
!
61
Bortz, J., & Döring, N. (2005). Forschungsmethoden und Evaluation für Human- und
Sozialwissenschaftler [Research methods and evaluation for humanities and
social sciences] (3rd ed.). Heidelberg: Springer.
Boyes Braem, P. (1984). Studying Swiss German Sign Language dialects. In F. Loncke,
P. Boyes Braem, & Y. Lebrun (Eds.), Recent research on European sign
languages (pp. 93–103). Lisse, Netherlands: Swets & Zeitlinger.
Boyes Braem, P. (1995). Einführung in die Gebärdensprache und ihre Erforschung
[Introduction into sign language and sign language research] (Vol. 11).
Hamburg: Signum.
Boyes Braem, P. (2001). A multimedia bilingual database for the lexicon of Swiss
German Sign Language. Sign Language & Linguistics, 4(1/2), 133–143.
https://doi.org/10.1075/sll.4.12.10boy
Boyes Braem, P. (2004a). Gebärdensprachkurs Deutschschweiz, Stufe 1. Linguistischer
Kommentar [Sign language course DSGS, level 1. Linguistic comments].
Zürich: GS-Media/Schweizerischer Gehörlosenbund SGB.
Boyes Braem, P. (2004b). Gebärdensprachkurs Deutschschweiz, Stufe 2. Linguistischer
Kommentar [Sign language course DSGS, level 2. Linguistic comments].
Zürich: GS-Media/Schweizerischer Gehörlosenbund SGB.
Boyes Braem, P. (2005a). Gebärdensprachkurs Deutschschweiz, Stufe 3. Linguistischer
Kommentar [Sign language course DSGS, level 3. Linguistic comments].
Zürich: GS-Media/Schweizerischer Gehörlosenbund SGB.
Boyes Braem, P. (2005b). Gebärdensprachkurs Deutschschweiz, Stufe 4. Linguistischer
Kommentar [Sign language course DSGS, level 4. Linguistic comments].
Zürich: GS-Media/Schweizerischer Gehörlosenbund SGB.
Boyes Braem, P. (2012a). Overview of research on the signed languages of the deaf.
Lecture, University of Basel.
Boyes Braem, P. (2012b). Evolving methods for written representations of signed
languages of the deaf. In A. Ender, A. Leemann, & B. Waelchli (Eds.), Methods
in contemporary linguistics (pp. 411–438). Berlin: De Gruyter Mouton.
Boyes Braem, P., Ebling, S., Tissi, K., Sidler-Miserez, S., & Haug, T. (2016). Item
reduction process for SMILE project. Unpublished document, HfH.
Boyes Braem, P., Haug, T., & Shores, P. (2012). Gebärdenspracharbeit in der Schweiz:
Rückblick und Ausblick [Sign language research and application in Switzerland:
Retrospection and outlook]. Das Zeichen, 90, 58–74.
!
62
Boyes Braem, P., & Rathmann, C. (2010). Transmission of sign languages in Northern
Europe. In D. Brentari (Ed.), Cambridge language surveys: Sign languages (pp.
19–45). Cambridge: Cambridge University Press.
Brown, A. (1993). The role of test-taker feedback in the test development process: test-
takers’ reactions to a tape-mediated test of proficiency in spoken Japanese.
Language Testing, 10(3), 277–301.
https://doi.org/10.1177/026553229301000305
Brown, J. D. (2016). Statistics corner: Questions and answers about language testing
statistics. Tokyo: JALT Testing and Evaluation Special Interest Group.
Brown, D. H., & Abeywickrama, P. (2010). Language assessment: Principles and
classroom practice. White Plains, NY: Pearson Longman.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge:
Cambridge University Press.
Bühner, M. (2006). Einführung in die Test- und Fragebogenkonstruktion [Introduction
into the construction of tests and questionnaires] (2nd ed.). Munich: Pearson
Studium.
Caccamise, F., & Samar, V. (2009). Sign Language Proficiency Interview (SLPI):
Prenegotiation interrater reliability and rater validity. Contemporary Issues in
Communication Science and Disorders, 36, 36–47.
Cameron, L. (2002). Measuring vocabulary size in English as an additional language.
Language Teaching Research, 6(2), 145–173.
https://doi.org/10.1191/1362168802lr103oa
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to
second language teaching and testing. Applied Linguistics, 1(1), 1–47.
https://doi.org/10.1093/applin/I.1.1
Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University
Press.
Caselli, N. K., Sehyr, Z. S., Cohen-Goldberg, A. M., & Emmorey, K. (2017). ASL-
LEX: A lexical database of American Sign Language. Behavior Research
Methods, 49(2), 784–801. https://doi.org/10.3758/s13428-016-0742-0
Chen Pichler, D. (2010). Sources of handshape errors in first-time signers of ASL. In G.
Mathur & D. J. Napoli (Eds.), Deaf around the world (pp. 96–121). Oxford:
Oxford University Press.
!
63
Chen Pichler, D. (2013, 23.03). L2 acquisition of sign language. Paper presented at the
First Symposium on Sign Language Acquisition, Lisbon, Portugal.
Cohen, J. (1960). Coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20(1), 37–46.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Erlbaum.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.
Council of Europe (2009). Common European Framework of Reference for languages:
learning, teaching, assessment. Cambridge; Strasbourg: Cambridge University
Press; Council of Europe.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).
Dictionary of language testing. Cambridge: Cambridge University Press.
Dörnyei, Z. (2007). Research methods in applied linguistics. Oxford; New York:
Oxford University Press.
Ebbinghaus, H., & Heßmann, J. (2000). Leben im Kommunikationskonflikt: Zur
Ungleichsprachigkeit Hörender und Gehörloser [Livining in a conflict of
communication: Differences in the languages between deaf and hearing people].
In E. Hess-Lüttich & H. W. Schmitz (Eds.), Botschaften verstehen:
Kommunikationstheorie und Zeichenpraxis. Festschrift für Helmut Richter (pp.
47–66). Frankfurt am Main: Peter Lang.
Ebling, S. (2016). Automatic translation from German to synthesized Swiss German
Sign Language (Dissertation). Universität Zürich, Zürich. Retrieved from
http://www.cl.uzh.ch/dam/jcr:8c0f6d30-05dc-4e31-9324-0ed7ef74214b/
ebling_diss.pdf
Ebling, S., Camgöz, N. C., Boyes Braem, P., Tissi, K., Sidler-Miserez, S., Hatfield,
Bowden, R. (submitted). SMILE sign language recognition data set. 11th
Language Resources and Evaluation Conference (LREC 2018).
Efthimiou, E., Fotinea, S. E., Vogler, C., Hanke, T., Glauert, J., Bowden, R., … Segouat,
J. (2009). Sign language recognition, generation, and modelling: A research
effort with applications in deaf communication. In C. Stephanidis (Ed.),
Universal access in human-computer interaction. Addressing diversity, lecture
notes in computer science (Vol. 5614, pp. 21–30). Berlin: Springer.
!
64
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral
proficiency tasks: what does the test-taker have to offer? Language Testing,
19(4), 347–368. https://doi.org/10.1191/0265532202lt235oa
Emmorey, K. (2011). The neural systems underlying sign language. In M. Marschark
& P. E. Spencer (Eds.), The Oxford handbook of deaf studies, language, and
education (Vol. 1, 2nd ed.). Oxford: Oxford University Press.
Emmorey, K., & Herzig, M. (2003). Categorical versus gradient properties of classifier
constructions in ASL. In K. Emmorey (Ed.), Perspectives on classifier
constructions in sign languages (pp. 221–246). Mahwah, NJ: Erlbaum.
Enns, C., Haug, T., Herman, R., Hoffmeister, R. J., Mann, W., & Mcquarrie, L. (2016).
Exploring signed language assessment tools in Europe and North America. In
M. Marschark, V. Lampropoulou, & E. K. Skordilis (Eds.), Diversity in deaf
education (pp. 171–218). Oxford; New York: Oxford University Press.
Erlenkamp, S. (2001). Lexikalische Klassen und syntaktische Kategorien in der
Deutschen Gebärdensprache: Warum das Vorhandensein von Verben nicht
unbedingt Nomen erfordert [On lexical classes and syntactic categories in
German Sign Language: Why verbs do not require nouns]. In H. Leuninger &
K. Wempe (Eds.), Gebärdensprachlinguistik 2000 Theorie und Anwendung:
Vorträge vom Symposium Gebärdensprachforschung im Deutschsprachigen
Raum, Frankfurt a.M., 11.-13. Juni 1999 (Vol. 37, pp. 67–91). Hamburg:
Signum.
Eyckmans, J. (2004). Measuring receptive vocabulary size: reliability and validity of
the yes/no vocabulary test for French-speaking learners of Dutch. LOT, Utrecht.
Fehrmann, G., Huber, W., Jäger, L., Sieprath, H., & Werth, I. (1995a). Aufbau des
Aachener Tests zur Basiskompetenz in Deutscher Gebärdensprache (ATG)
[Design of the Aachen test of basic competence of German Sign Language
(ATG)]. Aachen: RWTH-Aachen, Germanistisches Institut & Neurologische
Klinik, Projekt DESIRE.
Fehrmann, G., Huber, W., Jäger, L., Sieprath, H., & Werth, I. (1995b). Linguistische
Konzeption des Aachener Tests zur Basiskompetenz in Deutscher
Gebärdensprache (ATG) Linguistic conception of the Aachen test of basic
competence of German Sign Language (ATG)]. Aachen: RWTH-Aachen,
Germanistisches Institut & Neurologische Klinik, Projekt DESIRE.
!
65
Fenlon, J., Schembri, A., Rentelis, R., Vinson, D., & Cormier, K. (2014). Using
conversational data to determine lexical frequency in British Sign Language:
The influence of text type. Lingua, 143, 187–202.
https://doi.org/10.1016/j.lingua.2014.02.003
Field, A. P. (2009). Discovering statistics using SPSS: (and sex, drugs and rock “n”
roll) (3rd ed.). Los Angeles: SAGE Publications.
Fulcher, G. (2014). Testing second language speaking. London: Routledge.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced
resource book. London: Routledge.
Gardner, D., & Davies, M. (2014). A new Academic Vocabulary List. Applied
Linguistics, 35(3), 305–327. https://doi.org/10.1093/applin/amt015
Green, R. (2013). Statistical analyses for language testing. Basingstoke, UK: Palgrave
Macmillian.
Gwet, K. L. (2014). Handbook of inter-rater reliability. Gaithersburg, MD: Advanced
Analytics.
Hanke, T. (2014). Involving the crowd: Complementing corpus data in the process of
dictionary making. Paper presented at the Exploring new ways of harvesting and
generating sign language resources: Legal, technical, and crowd-sourcing issues,
Hamburg, Germany.
Hanke, T. (2016). Towards a visual sign language corpus linguistics. Presented at the
LREC 2016 Workshop on the Representation and Processing of Sign
Languages: Corpus Mining, Portorož, Slovenia.
Hanke, T. (2017). Wörterbuch ohne Wörter? Zum Entstehen eines Wörterbuches der
Deutschen Gebärdensprache [A dictionary without words? On the creation of a
dictionary for German Sign Language]. In Heidelberger Akademie der
Wissenschaften (Ed.), Jahrbuch der Heidelberger Akademie der Wissenschaften
für 2016 (pp. 84–88). Heidelberg: Universitätsverlag Winter.
Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a placement tool.
System, 37(4), 614–626. https://doi.org/10.1016/j.system.2009.09.006
Haug, T. (2005). Review of sign language assessment instruments. Sign Language &
Linguistics, 8(1/2), 61–98. https://doi.org/10.1075/sll.8.1.04hau
Haug, T. (2011). Adaptation and evaluation of a German Sign Language test - A
computer-based receptive skills test for deaf children ages 4–8 years old.
!
66
Hamburg: Hamburg University Press. Retrieved from http://hup.sub.uni-
hamburg.de/purl/HamburgUP_Haug_Adaption
Haug, T. (2015). Use of information and communication technologies in sign language
test development: Results of an international survey. Deafness & Education
International, 17(1), 33–48.
https://doi.org/10.1179/1557069X14Y.0000000041
Haug, T., & Mann, W. (2008). Adapting tests of sign language assessment for other sign
languages: A review of linguistic, cultural, and psychometric problems. Journal
of Deaf Studies and Deaf Education, 13(1), 138–147.
https://doi.org/10.1093/deafed/enm027
Hauser, P., Supalla, T., & Bavelier, D. (2008). American Sign Language sentence
reproduction test: Development and implications. In R. Müller de Quadros (Ed.),
Sign languages: Spinning and unraveling the past, present and future (pp. 160–
172). Florianopolis, Brazil: Editora Arara Azul.
Hermans, D., Knoors, H., & Verhoeven, L. (2009). Assessment of sign language
development: The case of deaf children in the Netherlands. Journal of Deaf
Studies and Deaf Education, 15(2), 107–119.
https://doi.org/10.1093/deafed/enp030
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary test:
Correction for guessing and response style. Language Testing, 19(3), 227–245.
https://doi.org/10.1191/0265532202lt229oa
Johnston, T., & Schembri, A. (1999). On defining lexeme in a signed language. Sign
Language & Linguistics, 2(2), 115–185. https://doi.org/10.1075/sll.2.2.03joh
Johnston, T., & Schembri, A. (2006). Issues in the creation of a digital archive of a
signed language. L. Barwick & N. Thieberger (Eds.), Sustainable data from
digital fieldwork (pp. 7-16). Sydney: Sydney University Press.
Johnston, T., & Schembri, A. (2007). Australian Sign Language: An introduction to
sign language linguistics. Cambridge: Cambridge University Press.
Klima, E., & Bellugi, U. (1979). Signs of language. Cambridge, MA: Harvard
University Press.
Kline, P. (2000). Handbook of psychological testing (2nd ed.). London: Roudledge.
Koizumi, R., & In’nami, Y. (2013). Vocabulary knowledge and speaking proficiency
among second language learners from novice to intermediate levels. Journal of
!
67
Language Teaching and Research, 4(5), 900–913.
https://doi.org/10.4304/jltr.4.5.900-913
König, S., Konrad, R., & Langer, G. (2012). Lexikon: Der Wortschatz der DGS
[Lexicon: The vocabulary of German Sign Language]. In H. Eichmann, M.
Hansen, & J. Heßmann (Eds.), Handbuch Deutsche Gebärdensprache:
Sprachwissenschaftliche und anwendungsbezogene Perspektiven (Vol. 50, pp.
111–164). Seedorf: Signum.
Kremmel, B., & Schmitt, N. (2016). Interpreting vocabulary test scores: What do
various item formats tell us about learners’ ability to employ words? Language
Assessment Quarterly, 13(4), 377–392.
https://doi.org/10.1080/15434303.2016.1237516
Landis, J. R., & Koch, G. (1977). The measurement of observer agreement for categorial
data. Biometrics, 33, 159–174.
Langer, G. (2005). Bilderzeugungstechniken in der Deutschen Gebärdensprache
[Image-producing techniques in German Sign Language]. Das Zeichen, 70, 254–
270.
Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we need
both to measure vocabulary knowledge? Language Testing, 21(2), 202–226.
https://doi.org/10.1191/0265532204lt277oa
Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength, and
computer adaptiveness. Language Learning, 54(3), 399–436.
Laufer, B., & Nation, P. (1999). A vocabulary-size test of controlled productive ability.
Language Testing, 16(1), 33–51. https://doi.org/10.1177/026553229901600103
Leeson, L., & Grehan, C. (2009). A Common European Framework for sign language
curricula? D-Sign(ing) a curriculum aligned to the Common European
Framework of Reference. In M. Mertzani (Ed.), Sign language teaching and
learning - Papers from the 1st symposium in applied sign linguistics (Vol. 1, pp.
21–33). Bristol: Centre for Deaf Studies, University of Bristol.
Leeson, L., Haug, T., Rathmann, C., Van den Bogaerde, B., & Sheneman, N.
(submitted). The implementation of the Common European Framework of
Reference (CEFR) for signed languages in higher education – Results of an
international survey. Manuscript.
Lienert, G., & Raatz, U. (1998). Testaufbau und Testanalyse [Test construction and test
analysis] (6th ed.). Weinheim, Germany: Beltz Psychologie Verlags Union.
!
68
Lillo-Martin, D. (1999). Modality effects and modularity in language acquisition: The
acquisition of American Sign Language. In W. Ritchie & T. Bhatia (Eds.),
Handbook of child language acquisition (pp. 531–568). San Diego: Academic
Press.
Mann, W. (2006). Examining German deaf children’s understanding of referential
distinction in written German and German Sign Language (DGS). Unpublished
doctoral dissertation. San Francisco State University & University of California,
Berkeley, San Francisco, CA.
Mann, W., & Marshall, C. (2012). Investigating deaf children’s vocabulary knowledge
in British Sign Language: Vocabulary development in sign language. Language
Learning, 62(4), 1024–1051. https://doi.org/10.1111/j.1467-9922.2011.00670.x
Mann, W., Marshall, C. R., Mason, K., & Morgan, G. (2010). The acquisition of sign
language: The impact of phonetic complexity on phonology. Language Learning
and Development, 6(1), 60–86. https://doi.org/10.1080/15475440903245951
Mann, W., Roy, P., & Morgan, G. (2015). Adaptation of a vocabulary test from British
Sign Language to American Sign Language. Language Testing, 33(1), 3–22.
https://doi.org/10.1177/0265532215575627
Marian, V., Blumenfeld, H. K., & Kaushanskaya, M. (2007). The language experience
and proficiency questionnaire (LEAP-Q): Assessing language profiles in
bilinguals and multilinguals. Journal of Speech, Language, and Hearing
Research, 50, 940–967.
Mayberry, R. I., Hall, M. L., & Zvaigzne, M. (2013). Subjective frequency ratings for
432 ASL signs. Behavior Research Methods. https://doi.org/10.3758/s13428-
013-0370-x
Mayberry, R. I., & Squires, B. (2006). Sign language acquisition. In E. Lieven (Ed.),
Encyclopedia of language and linguistics (2nd ed., pp. 291–296). Oxford:
Elsevier.
McKee, R. (2015). New Zealand Sign Language: A learner’s grammar (1st ed.). Bridget
Williams Books. Retrieved from http://www.bwb.co.nz/books/new-zealand-
sign-language
McNamara, T. F. (1996). Measuring second language performance. London; New
York: Longman.
Meara, P. (1992). EFL vocabulary test. Swandea, UK: Centre for Applied Language
Studies.
!
69
Meara, P. (1996). The dimension of lexical competence. In G. Brown, K. Malmkjaer,
& J. Williams (Eds.), Performance and competence in second language
acquisition (pp. 35–53). Cambridge: Cambridge University Press.
Meara, P. (2005). Designing vocabulary tests for English, Spanish and other languages.
In C. S. Butler, M. de los Á. Gómez González, & S. M. Doval-Suárez (Eds.),
The dynamic of language use (Vol. 140, pp. 271–285). Amsterdam: John
Benjamins.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary tests.
Language Testing, 4, 142–154. https://doi.org/10.1177/026553228700400202
Meara, P., & Fitzpatrick, T. (2000). Lex30: An improved method of assessing
productive vocabulary in an L2. System, 28, 19–30.
https://doi.org/10.1016/S0346-251X(99)00058-5
Meara, P., & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell
(Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Mochida, A., & Harrington, M. (2006). The Yes/No test as a measure of receptive
vocabulary knowledge. Language Testing, 23(1), 73–98.
https://doi.org/10.1191/0265532206lt321oa
Morgan, G., & Woll, B. (Eds.). (2002). Directions in sign language acquisition – Trends
in language acquisition research. Amsterdam: John Benjamins.
Mousavi, S. A. (2012). An encyclopedic dictionary of language testing (5th ed.). Tehran:
Rahnama Press.
Nation, P. (2001). Learning vocabulary in another language. Cambridge; New York:
Cambridge University Press.
Nation, P. (2004). A study of the most frequent word families in the British National
Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language
(pp. 3–13). Amsterdam: John Benjamins.
Nation, P. (2016). Making and using word lists for language learning and testing.
Amsterdam: John Benjamins.
Osterlind, S. J. (2001). Constructing test items: Multiple-choice, constructed-response,
performance, and other formats (2nd ed.). Boston: Kluver Academic Publishers.
Palfreyman, N., Sagara, K., & Zeshan, U. (2015). Methods in carrying out language
typological research. In E. Orfanidou, B. Woll, & G. Morgan (Eds.), Research
methods in sign language studies: A practical guide (pp. 173–192). Chichester,
UK: Wiley Blackwell.
!
70
Pallant, J. (2005). SPSS survival manual: A step by step guide to data analysis using
SPSS for Windows (Version 12). Crown Nest, Australia: Allen & Unwin.
Pallant, J. (2016). SPSS survival manual: A step by step guide to data analysis using
IBM SPSS. Berkshire, UK: Open University Press.
Pellicer-Sánchez, A., & Schmitt, N. (2012). Scoring Yes–No vocabulary tests: Reaction
time vs. nonword approaches. Language Testing, 29(4), 489–509.
https://doi.org/10.1177/0265532212438053
Pfau, R., & Quer, J. (2010). Nonmanuals: Their grammatical and prosodic roles. In D.
Brentari (Ed.), Sign languages (pp. 381–403). Cambridge: Cambridge
University Press.
Pfau, R., Steinbach, M., & Woll, B. (2012). Sign language - An international handbook.
Berlin; Boston: DeGruyter.
Popham, W. J. (2000). Modern educational measurement: Practical guidelines for
educational leaders (3rd ed.). Boston: Allyn and Bacon.
Purpura, J. E. (2008). Assessing communicative language ability: Models and their
components. In E. Shohamy & N. H. Hornberger (Eds.), Language testing and
assessment: Vol. 7. Encyclopedia of language and education (2nd ed., pp. 53–
68). New York: Springer.
Read, J. (1993). The development of a new measure of L2 vocabulary knowledge.
Language Testing, 10(3), 355–371.
https://doi.org/10.1177/026553229301000308
Read, J. A. S. (2000). Assessing vocabulary. Cambridge; New York: Cambridge
University Press.
Read, J. (2007). Second language vocabulary assessment: Current practices and new
directions. International Journal of English Studies, 7(2), 105–125.
Read, J. A. S., & Chapelle, C. A. (2001). A framework for second language vocabulary
assessment. Language Testing, 18(1), 1–32.
https://doi.org/10.1177/026553220101800101
Rosen, R. S. (2004). Beginning L2 production errors in ASL lexical phonology: A
cognitive phonology model. Sign Language & Linguistics, 7(1), 31–61.
https://doi.org/10.1075/sll.7.1.04beg
Rust, J., & Golombok, S. (2000). Modern psychometrics – The science of psychological
assessment (2nd ed.). London; New York: Routledge.
!
71
Ryan, D. E. (2014). Consider the candidate: Using test-taker feedback to enhance
quality and validity in language testing. E-TEALS: An e-Journal of Teacher
Education and Applied Language Studies, 5, 1–23.
Sato, T., & Ikeda, N. (2015). Test-taker perception of what test items measure: a
potential impact of face validity on student learning. Language Testing in Asia,
5(1). https://doi.org/10.1186/s40468-015-0019-z
Schmitt, N. (2010). Researching vocabulary. London: Palgrave Macmillan. Retrieved
from http://link.springer.com/10.1057/9780230293977
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows.
Language Learning, 64(4), 913–951. https://doi.org/10.1111/lang.12077
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text
and reading comprehension. The Modern Language Journal, 95(1), 26–43.
https://doi.org/10.1111/j.1540-4781.2011.01146.x
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the
behaviour of two new versions of the Vocabulary Levels Test. Language
Testing, 18(1), 55–88. https://doi.org/10.1177/026553220101800103
Schmitt, N., & Zimmermann, C. B. (2002). Derivate word forms: What do learners
know? TESOL Quarterly, 36(2), 145–171.
Sevigny, P., & Ramonda, K. (2013). Vocabulary: What should we test? In N. Sonda &
A. Krause (Eds.), JALT2012 Conference Proceedings (pp. 701–711). Tokyo:
JALT.
Sinclair, J. (2005). Corpus and text: Basic principles. In M. Wynne (Ed.), Developing
linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.
Stewart, J. (2012). A multiple-choice test of active vocabulary knowledge. Vocabulary
Learning and Instruction, 1(1), 53–59. https://doi.org/10.7820/vli.v01.1.stewart
Stokoe, W. C. (1960). Studies in linguistics - Sign language structure: An outline of the
visual communication systems of the American deaf. Buffalo, NY: University of
Buffalo.
Stricker, L. J., Wilder, G. Z., & Rock, D. A. (2004). Attitudes about the computer-based
test of English as a foreign language. Computers in Human Behavior, 20(1), 37–
54. https://doi.org/10.1016/S0747-5632(03)00046-3
Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation rates in yes/no
vocabulary tests change with Japanese university students’ English ability
!
72
levels? Language Testing, 29(4), 471–488.
https://doi.org/10.1177/0265532211433033
Stubbe, R. (2015). Replacing translation tests with Yes/No tests. Vocabulary Learning
and Instruction, 4(2), 38–48. https://doi.org/10.7820/vli.v04.2.stubbe
Stubbe, R., & Stewart, J. (2012). Optimizing scoring formulas for yes/no vocabulary
tests with linear models. Shiken Research Bulletin, 16(2), 2–7.
Stubbe, R., Stewart, J., & Pritchard, T. (2010). Examining the effects of pseudowords
in yes/no vocabulary tests for low level learners. Language Education and
Research Center Journal, 5, 1–16.
Sutton-Spence, R., & Woll, B. (1999). The linguistics of British Sign Language: An
introduction. Cambridge: Cambridge University Press.
Taub, S., Galvan, D., Pinar, P., & Mather, S. (2008). Gestures and ASL L2 acquisition.
In R. Müller de Quadros (Ed.), Sign languages: Spinning and unraveling the
past, present and future (pp. 639–651). Florianopolis, Brazil: Editora Arara
Azul.
Thompson, R. L., Emmorey, K., & Kluender, R. (2009). Learning to look: The
acquisition of eye gaze agreement during the production of ASL verbs.
Bilingualism: Language and Cognition, 12(04), 393–409.
https://doi.org/10.1017/S1366728909990277
Thompson, R. L., Vinson, D. P., & Vigliocco, G. (2009). The link between form and
meaning in American Sign Language: Lexical processing effects. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 35(2), 550–557.
https://doi.org/10.1037/a0014547
Valli, C., & Lucas, C. (1995). Linguistics of American Sign Language: An introduction
(2nd ed.). Washington, D.C: Gallaudet University Press.
Vinson, D. P., Cormier, K., Denmark, T., Schembri, A., & Vigliocco, G. (2008). The
British Sign Language (BSL) norms for age of acquisition, familiarity, and
iconicity. Behavior Research Methods, 40(4), 1079–1087.
https://doi.org/10.3758/BRM.40.4.1079
Weir, C. (2005). Language testing and validation. New York: Palgrave Macmillian.
Woll, B. (2013). Second language acquisition of sign language. In C. A. Chapelle (Ed.),
The encyclopedia of applied linguistics. Oxford: Blackwell Publishing Ltd.
Retrieved from http://doi.wiley.com/10.1002/9781405198431.wbeal1050
!
73
Woolfe, T., Herman, R., Roy, P., & Woll, B. (2010). Early vocabulary development in
deaf native signers: A British Sign Language adaptation of the communicative
development inventories. Journal of Child Psychology and Psychiatry, 51(3),
322–331. https://doi.org/10.1111/j.1469-7610.2009.02151.x
Yan, X., Thirakunkovit, S. P., Kauper, N. L., & Ginther, A. (2016). What do test-takers
say? Test-taker feedback as input for quality management of a local oral English
proficiency test. In J. Read (Ed.), Post-admission language assessment of
university students (Vol. 6, pp. 113–136). Cham: Springer International
Publishing. https://doi.org/10.1007/978-3-319-39192-2_6
Zeshan, U. (2004a). Hand, head, and face: Negative constructions in sign languages.
Linguistic Typology, 8, 1–58. https://doi.org/10.1515/lity.2004.003
Zeshan, U. (2004b). Interrogative constructions in sign languages: A cross-linguistic
perspective. Language, 80(1), 7–39. https://doi.org/10.1353/lan.2004.0050
Zeshan, U., & Perniss, P. (2008). Possessive and existential constructions: Introduction
and overview. In P. Perniss & U. Zeshan (Eds.), Possessive and existential
constructions in sign languages (pp. 2–31). Nijmegen, The Netherlands: Ishara
Press.
!
74
... Bachman's CLA has been chosen because it is known to be one of the state-of-the-art models of CLA (Alderson & Banerjee, 2002). The construct definition was defined during the development of two vocabulary size tests for DSGS (Haug, 2017). ...
... Haug (2017) developed two vocabulary size tests for DSGS, an L1/L2 translation test and a yes/no test. He defines the construct as the "size of vocabulary knowledge of beginning adult learners of DSGS at the level of A1 [according to the CEFR]" (p. ...
Chapter
Full-text available
Descriptions of test constructs in second signed language assessment, such as vocabulary knowledge, are rare and far from receiving the attention by the field of signed language test research that they deserves. Detailing the construct in second signed language assessment poses a challenge for obvious reasons: only very few published studies on s