ArticlePDF Available

A novel rater agreement methodology for language transcriptions: Evidence from a nonhuman speaker

Springer Nature
Quality & Quantity
Authors:

Abstract and Figures

The ability to measure agreement between two independent observers is vital to any observational study. We use a unique situation, the calculation of inter-rater reliability for transcriptions of a parrot’s speech, to present a novel method of dealing with inter-rater reliability which we believe can be applied to situations in which speech from human subjects may be difficult to transcribe. Challenges encountered included (1) a sparse original agreement matrix which yielded an omnibus measure of inter-rater reliability, (2) “lopsided” 2×22\times 2 2 × 2 matrices (i.e. subsets) from the overall matrix and (3) categories used by the transcribers which could not be pre-determined. Our novel approach involved calculating reliability on two levels—that of the corpus and that of the above mentioned smaller subsets of data. Specifically, the technique included the “reverse engineering” of categories, the use of a “null” category when one rater observed a behavior and the other did not, and the use of Fisher’s Exact Test to calculate r r -equivalent for the smaller paired subset comparisons. We hope this technique will be useful to those working in similar situations where speech may be difficult to transcribe, such as with small children.
Content may be subject to copyright.
1 23
Quality & Quantity
International Journal of Methodology
ISSN 0033-5177
Qual Quant
DOI 10.1007/s11135-013-9894-5
A novel rater agreement methodology for
language transcriptions: evidence from a
nonhuman speaker
Allison B.Kaufman, Erin N.Colbert-
White & Robert Rosenthal
1 23
Your article is protected by copyright and all
rights are held exclusively by Springer Science
+Business Media Dordrecht. This e-offprint
is for personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
Qual Quant
DOI 10.1007/s11135-013-9894-5
A novel rater agreement methodology for language
transcriptions: evidence from a nonhuman speaker
Allison B. Kaufman ·Erin N. Colbert-White ·
Robert Rosenthal
© Springer Science+Business Media Dordrecht 2013
Abstract The ability to measure agreement between two independent observers is vital to
any observational study. We use a unique situation, the calculation of inter-rater reliability
for transcriptions of a parrot’s speech, to present a novel method of dealing with inter-
rater reliability which we believe can be applied to situations in which speech from human
subjects may be difficult to transcribe. Challenges encountered included (1) a sparse original
agreement matrix which yielded an omnibus measure of inter-rater reliability, (2) “lopsided”
2×2 matrices (i.e. subsets) from the overall matrix and (3) categories used by the transcribers
which could not be pre-determined. Our novel approach involved calculating reliability on
two levels—that of the corpus and that of the above mentioned smaller subsets of data.
Specifically, the technique included the “reverse engineering” of categories, the use of a
“null” category when one rater observed a behavior and the other did not, and the use of
Fisher’s Exact Test to calculate r-equivalent for the smaller paired subset comparisons. We
hope this technique will be useful to those working in similar situations where speech may
be difficult to transcribe, such as with small children.
Keywords Inter-rater reliability ·Rater agreement ·Fisher’s Exact Test ·r-Equivalent ·
Sparse agreement matrix ·Speech transcription
Allison B. Kaufman is now in the Department of Ecology and Evolutionary Biology at The University of
Connecticut.
A. B. Kaufman (B
)
California State University, San Bernardino, 5500 University Parkway,
San Bernardino, CA 92407, USA
e-mail: akaufman@csusb.edu
E. N. Colbert-White
Department of Psychology, University of Puget Sound, 1500 N. Warner #1046, Tacoma, WA 98416, USA
e-mail: ecolbertwhite@pugetsound.edu
R. Rosenthal
Department of Psychology, University of California, Riverside,
900 University Ave., Riverside, CA 92521, USA
e-mail: robert.rosenthal@ucr.edu
123
Author's personal copy
A. B. Kaufman et al.
Tab le 1 2×2 Agreement matrix
showing 57 % rater agreement
The associated rvalue in this
example was .27, which is
statistically significant in the
opposite direction
RATER 2 RATER 1
Yes No
Yes 57 21
No 22 0
1 Inter-rater reliability and speech transcriptions
By definition, inter-rater reliability calculates the degree of match between two independent
observers witnessing the same event, thus providing a measure of how reliable the researcher
is in his or her recordings (Tinsley and Weiss 1975). Traditionally, reliability measurements
are made in one of two situations. In the first situation, such as a social worker counting aggres-
sive behaviors at a playground, observers are tasked with watching subjects and selecting
behaviors they have witnessed from a list of expected behaviors or behaviors of interest. In
the second situation, such as that same social worker scoring the intensity of a fight between
two children, observers make ratings on scales to quantify observed behavior. Unfortunately,
as much as it would be preferable, calculating inter-rater reliability is not so straightforward
for all events and behaviors.
Researchers transcribing language, particularly that of inexperienced speakers like chil-
dren, must be aware of the inherent difficulty of the transcription process. Typically, reliability
between observers is calculated via a symmetrical matrix of the potential options a rater might
code. These categories are distinct, mutually exclusive, and definitive. For example, if the cat-
egories are Red, Green, Blue and Yellow, raters are responsible for observing and coding from
these four choices and only these four choices. However, the speech of a child still in the early
stages of language development is decidedly ambiguous (Geert and Dijk 2003); and as any
parent can attest, distinguishing a child’s words is a skill developed only with much practice.
In addition to this, language provides a very unique rating situation as the set of potential “cat-
egories” to be transcribed (i.e., “coded”) consists of every word or sound raters could possibly
identify. In this way, the number of categories is practically infinite. The idea of a theoretically
infinite set of categories has not yet been addressed, to our knowledge, in methodological liter-
ature. Further, despite its relevance to studies involving human speech, data on inter-rater reli-
ability calculations of speech transcriptions are few (Stockman 2010), and empirical studies of
transcriptions demonstrate significant error between coders (for discussion, see Lindsay and
O’Connell 1995;Geert and Dijk 2003). For example, in a study aimed specifically at testing
inter-rater reliability, Stockman (2010) used percent agreement between raters (which is con-
siderably less stringent and less informative than Cohen’s kappa; Cohen 1960), and still found
very high levels of disagreement. Specifically, only 57 % of overall agreement was achieved
across raters on word boundary location in the spontaneous speech of preschool-aged chil-
dren.1This particularly dismal amount of agreement evidences the difficulty of the task at
hand.
We describe here data potentially more complex than even children’s language, in the
hope that the reliability techniques developed will prove useful in other situations. The
1Theoretically, in a case such as this, the 57 % agreement can be dramatically inflated. See Table 1for an
example scenario.
123
Author's personal copy
A novel rater agreement methodology
scenario presented involves the vocalizations of an African Grey parrot (Psittacus eritha-
cus) by the name of Cosmo. A previous study involved transcriptions of Cosmo’s vocal-
izations by two raters (ECW and ABK; Colbert-White et al. 2011). ECW coded Cosmo’s
vocalizations from video and ABK coded a portion of those sessions to assess reliability.
Issues of word ambiguity due to low audio clarity, as well as the lack of a pre-determined
coding scheme as discussed above, and the desire to examine reliability at two different
levels the coding matrix, led us to develop a novel methodology for reliability calcula-
tions which featured the measurement of reliability at both the level of the overall corpus
(i.e. body of text) and smaller subsets of interest. We hope that despite development with
a non-human speaker, the techniques presented here will be helpful to those working with
human subjects.
2 General and specific calculations of reliability
For the analysis of Cosmo’s speech, it was desirable to calculate reliability in two different
areas. Primarily, it was important to calculate reliability over the entire corpus for the pur-
poses of the Colbert-White et al. (2011) manuscript. However, we also sought information on
the coding reliability for smaller subsets of the corpus. These subsets ranged from individual
word occurrences (e.g., hello) to groupings of similar words (e.g., all words that occurred at
the beginning of a phrase). Reliability data on these smaller subsets would allow us to inves-
tigate the specific causes of rater disagreement—for example, if disagreement was higher on
non-word sounds more training might be required, or if it was higher on words that began
phrases, adjustments to the audio might be made to increase clarity. We will examine each
of these two levels of reliability separately, as the characteristics of each dictated the use of
different methodologies.
An example coding matrix can be found in Table 2. Coding for reliability was done via
one minute intervals in which the rater transcribed every sound vocalized by Cosmo, along
with the specific time of the vocalization. As Cosmo tends to speak in diverse phrases, this
resulted in reliability matrices that were both large and sparse. In Table 2, the example matrix
contains 23 different categories coded, but many occurred only once in the course of the
minute (see, for example, the word “come”). In addition, categories for coding could not be
specified in advance. These categories were, for all intents and purposes, the words Cosmo
uttered and therefore were only designated categories after they had been coded by one or
more of the raters.
Traditionally, one of the most often used measures of inter-rater reliability is Cohen’s
kappa (Cohen 1960). Unfortunately, kappa becomes less informative in situations where
the coding matrix is larger than 2 ×2anddf >1 (as was the case with our data); in
these situations, kappa becomes an omnibus statistic and it becomes impossible to determine
whether, for example, the raters were equally reliable in all coding options, or whether
they were excellent at some coding options and poor at others (Rosenthal 2005;Rosenthal
and Rosnow 2008). In some situations, even a focused kappa (i.e., 2 ×2 with df =1)is
less informative than an rvalue, as only an rvalue can yield a meaningful coefficient of
determination or binomial effect size display (BESD; Rosenthal and Rubin 1982;Rosenthal
2005;Rosenthal and Rosnow 2008). We determined the best way to handle the situation
was to use a kappa value for overall reliability at the level of the corpus (see below for
reasons), and then to use rvalues to further examine reliabilities of selected subsets of
data.
123
Author's personal copy
A. B. Kaufman et al.
Tab le 2 Sample reliability coding matrix
ABK
a bye come cosmo DB dogs DW for go gonna good hello i ID love MWH null NWM okay on walk we’re you
ECWa 30 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
bye 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
come 00 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
cosmo00 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
DB 00 0 0 2 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0
dogs 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
DW 00 0 0 0 0 2 0 0 0 0 0 00 0 2 0 2 0 0 0 0 0
for00000003000000000000000
go 00 0 0 0 0 0 0 3 0 0 0 00 0 0 0 0 0 0 0 0 0
gonna 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0
good 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
hello 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
i 0000 0000000010000000000
ID 01 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0
love0 00 0000000000100000000
MWH0000 0000000000010000000
null00001000000100000000000
NWM0000 0000000000000100000
okay00000000000000000030000
on 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0
walk 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 3 0 0
were0000 0000000000000000030
you 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 1
To illustrate how the matrix is interpreted, rater ECW coded six occurrences of DW [DOG WHINE/WHIMPER], but rater ABK was only in agreement at two of these occurrences
(coding the other incidences as MWH [OTHER ONE-NOTE WHISTLE]and NWM [OTHER NON-WHISTLE SOUND]).
123
Author's personal copy
A novel rater agreement methodology
3 Measuring reliability at the corpus level
3.1 Other available methods
Many statistics for determining inter-rater reliability are readily available, however, we found
none were appropriate for our data. Brennan and Light (1974) developed a test statistic for
inter-rater reliability in cases of raters selecting their own categories. The Astatistic was
based upon rater classification of the behavior of two children into either the same or different
categories. The benefit of this approach was that the statistic addressed instances in which
raters agreed upon what the behavior was, as well as instances in which raters agreed upon
what the behavior was not, thus circumventing the main pitfall of the percent agreement
statistic (Kaufman and Rosenthal 2009;seeTable2). As Table 1demonstrates, percent
agreement often overestimates the actual agreement between observers, resulting in high
percent agreement and an rstatistic that is low or even statistically significant in the opposite
direction. Brennan and Light (1974) also assumed the marginal totals to be fixed. This is not
necessarily the case with the data presented here, as oftentimes one rater heard a vocalization
which was not heard by the other rater.
Another statistic, Yule’s Q(Montgomery and Crittenden 1977), was considered because
of its common usage when reliability coding matrices are sparse—as in the case of our data.
Though a Yule’s Qcalculation might appear appropriate on the surface, the statistic was
originally proposed by Montgomery and Crittenden (1977) in a method in which multiple
categories from each rater were reduced to a 2×2 table by grouping perceived subcategories.
For example, in a situation in which four instances of a behavior are coded into one category
by Rater 1 but coded into three smaller, more distinct categories by Rater 2, Montgomery and
Crittenden suggested combining Rater 2’s categories to match the larger category established
by Rater 1. While this may eliminate the problem of different numbers of categories proposed
by the raters, it may also artificially inflate their agreement (as new categories are considered
subcategories of the original, and therefore may be erroneously recorded as agreements). In
addition, when a matrix larger than 2×2 is created this way, the inter-rater reliability statistic
becomes an omnibus statistic and thereby is subject to the same disadvantages as Cohen’s
kappa.
Scott’s π(Scott 1955) and Krippendorff’s α(Krippendorff 1978) are popular in content
analysis research. However, Scott’s πdoes not account for rater bias in coding; that is, a
rater’s tendency to prefer to code particular categories over others. This was a salient issue
as one rater (ECW) was far more familiar with Cosmo’s everyday vocabulary than the other.
Krippendorff’s αis, unfortunately, a very complex calculation which is not readily available
in traditional statistical packages.
We considered two other statistics—Hubert’s and the J-Index—which also measure
inter-rater agreement when categories are developed by the raters themselves (also unchar-
acteristic of our data set, see below for more details; Hubert 1977;Popping 1983,1984).
However, neither of these reliability statistics is commonly applied to sparse matrices larger
than 2 ×2 (regardless of the complexity of the dataset).
3.2 Our method
Given the number of imperfect statistics for assessing inter-rater reliability in speech tran-
scriptions at the level of the overall matrix, we elected to use the kappa statistic supplemented
with additional measurements. These additional measurements would serve to confirm that
the reliability found by kappa was due to reliability spread out fairly evenly across all coding
123
Author's personal copy
A. B. Kaufman et al.
categories (i.e. words), as opposed to raters doing an exceptional job of coding some categories
and a poor job of coding others. We calculated reliability in these subsets via r-equivalent
(Rosenthal and Rubin 1982).
4 Measuring reliability at the token lsevel
4.1 The nature of the categories
Cosmo’s vocalizations presented a unique challenge for assessing inter-rater reliability
because, to a naïve transcriber, Cosmo’s vocal repertoire theoretically could have contained
any word in the English language, or any non-word sound, Cosmo could produce. Logisti-
cally, this would have made it impossible to pre-establish categories for coding. In addition,
akeyfocusoftheColbert-White et al. (2011) study was Cosmo’s ability to create and use
novel vocalizations. Providing the second observer (i.e., ABK) with Cosmo’s repertoire to
use as a coding scheme would have introduced the potential for a priming bias against coding
novel vocalizations as novel. For example, if “box” had been provided on an apriorilist,
ABK, hearing the novel utterance “bach” may have been primed to hear (and thus record) it
as “box.”
As previously mentioned, this meant that coding categories were essentially “reverse engi-
neered” for each minute. That is to say, the coding scheme for a particular minute consisted
of the set of words that had been heard at least once by either of the raters during a particular
minute of the video clip. There were two immediate consequences of this. First, a “null”
category was incorporated for instances in which one rater recorded a vocalization where
the other did not (e.g., in Table 2, ABK, but not ECW, recorded “hello” once). The second
consequence of the reverse-engineered coding scheme was the resulting necessity of a new
template for the calculations presented below, as an existing appropriate statistical package
could not be found. This template was developed in Microsoft Excel and the progression of
calculations is shown in Table 3.
4.2 The null category
The “null” category was used when one rater coded a word and the other rater coded nothing.
In these cases, the word “null” was placed in the transcription as a place holder for the
unheard word. This meant that during analysis the null category, by definition, consisted of
disagreements. When these instances were removed from calculations involving subsets of
data (see below), rvalues increased by .1–.2. The technique of using a null category is, as
far as we know, a novel contribution to the inter-rater reliability literature within the study of
language. As we were coding from video tape which could be re-watched, the transcription
was held to a higher standard of accuracy—by requiring a match for timestamps as well as
vocalizations. As a result, the purpose of the null category was to mark places in which only
one rater transcribed a vocalization and to assist in maintaining the temporal integrity of the
transcription (refer to the 24th s in Table 4for an example).
4.3 Calculation of r-equivalent
As with human speech, each individual minute of Cosmo’s speech contained a large number
of tokens, very few of which were repeated within the minute-long reliability segments. For
every vocalization coded by the observers, there were n1 words that the observers agreed
123
Author's personal copy
A novel rater agreement methodology
Tab le 3 Fisher’s Exact Tests for words in a minute-long reliability coding segment
YES/YES NO/YES YES/NO NO/NO pValue tValue t2t2+df t2/(t2+df)=r2r
a3 0041.000076 4.17 17.36 59.36 0.29 0.54
bye 0 0 1 43 1.00 – –
come1 00430.023 2.06 4.25 46.25 0.09 0.30
cosmo1 00430.023 2.06 4.25 46.25 0.09 0.30
DB2 11400.0093 2.45 5.98 47.98 0.12 0.35
dogs 1 0 1 42 0.045 1.73 2.99 44.99 0.07 0.26
DW2 40380.016 2.22 4.94 46.94 0.11 0.32
for3 0041.000076 4.17 17.36 59.36 0.29 0.54
go3 0041.000076 4.17 17.36 59.36 0.29 0.54
gonna 3 0 0 41 .000076 4.17 17.36 59.36 0.29 0.54
good 1 0 0 43 0.023 2.06 4.25 46.25 0.09 0.30
hello 0 0 1 43 1.00 – –
i1 00430.023 2.06 4.25 46.25 0.09 0.30
ID0 20421.00 – –
love 1 0 0 43 0.023 2.06 4.25 46.25 0.09 0.30
MWH1 02410.068 1.52 2.31 44.31 0.05 0.23
null 0 2 0 42 1.00 – –
NWM1 03400.091 1.36 1.84 43.84 0.04 0.21
okay 3 0 0 41 .000076 4.17 17.36 59.36 0.29 0.54
on1 00430.023 2.06 4.25 46.25 0.09 0.30
walk 3 0 0 41 .000076 4.17 17.36 59.36 0.29 0.54
were3 0041.000076 4.17 17.36 59.36 0.29 0.54
you 1 0 0 43 0.023 2.06 4.25 46.25 0.09 0.30
YES and NO comparisons are for ABK and ECW, respectively
123
Author's personal copy
A. B. Kaufman et al.
Tab le 4 Sample reliability
transcriptions
* null; capital letters denote
non-word sounds (see
Colbert-White et al. 2011 for full
repertoire)
Timestamp ABK ECW
:02 good good
bye ID
cosmo cosmo
ii
love love
you you
:08 okay okay
dogs ID
we’re we’re
gonna gonna
go go
for for
aa
walk walk
:11 okay okay
dogs dogs
we’re we’re
gonna gonna
go go
for for
aa
walk walk
:15 come come
on on
:17 MWH MWH
:20 DB DB
:22 NWM DW
:24 hello *
:25 NWM DB
:26 DB *
:27 DW DW
:35 NWM NWM
:40 NWM DW
:41 DB DB
:42 DW DW
:45 okay okay
we’re we’re
gonna gonna
go go
for for
aa
walk walk
:48 MWH DW
:57 MWH DW
123
Author's personal copy
A novel rater agreement methodology
Tab le 5 2×2 Agreement matrix
for the word “for” extracted from
Tab le 2matrix
p=.000076 (one tail),
rsample =1.00,
r-equivalent =.54
ECW ABK
Yes No
Yes 3 0
No 0 41
Tab le 6 2×2 Agreement matrix
for the word “on” extracted from
Tab le 2matrix
p=.23 (one tail),
rsample =1.00,
r-equivalent =.30
ECW ABK
Yes No
Yes 1 0
No 0 43
Tab le 7 Calculation of
r-equivalent for words that begin
phrases
The values in the boxes are totals.
For example, there were 87
instances over the entire corpus
where the two raters agreed on
what the word at the beginning of
a phrase was, and 2,674 instances
over the entire corpus when the
two raters agreed on what the
word at the beginning of a phrase
was not
ECW ABK
Yes No
Yes 87 16 103
No 12 2,674 2,686
99 2,690 2,789
df =2,787
Calculation Value Result
Fisher’s Exact 2.7859E134 pValue
Inverse t-test 53.60928363 t Value
t22873.955291 t2
t2+df 5660.955291 t2+df
t2/(t2+df)0.5.7680267 r
sqrt t2/(t2+df)0.712516854 r-eqiuvalent
that the coded word was not. For example, in a minute-long video clip during which Cosmo
vocalized 44 tokens, ABK and ECW agreed that the word “for” was uttered three times. In
this way, there were n3, or 41 instances (out of a total of 44 recorded tokens in the minute)
in which the observers agreed that the word “for” was not spoken. As a result, any subset of
data from within the larger matrix would have created a very “lopsided” 2 ×2 table which
would be inappropriate for χ2analysis (see Tables 5,6for examples).To circumvent this
problem, we elected to calculate r-equivalent for particular pairings. To do so, the probability
of a particular pattern of agreements/disagreements between the observers was calculated
via Fisher’s Exact Test (Rosenthal 2005;Rosenthal and Rosnow 2008). Fisher’s Exact Test
provides accurate pvalues for low expected-value 2 ×2 contingency tables that would not
fit the theoretical χ2distribution (Fisher 1941;Siegel 1956;Snedecor and Cochran 1989).
From this pvalue, an r-equivalent could be calculated. In a sense, in the context of this
methodology, the Fisher’s Exact Test was used as a tool or a means to an end, as opposed to
simply providing a significance test in and of itself. Tables 7and 8illustrate the calculation
of r-equivalent for inter-rater reliabilities on words that begin Cosmo’s phrases and words
that are in the middle of Cosmo’s phrases (for example, in the phrase “Cosmo’s a good
123
Author's personal copy
A. B. Kaufman et al.
Tab le 8 Calculation of
r-equivalent for words which are
inside phrases
ECW ABK
Yes No
Yes 8 1 1 0 91
No 19 2,738 2,757
100 2,748 2,848
df =2,846
Calculation Value Result
Fisher’s Exact 2.2098E126 pValue
Inverse t-test 50.02896524 t Value
t22502.897363 t2
t2+df 5348.897363 t2+df
t2/(t2+df)0.467927723 r2
Sqrt t2/(t2+df)0.684052427 r-eqiuvalent
bird,” Cosmo would be categorized as beginning the phrase and a,good,andbird would be
categorized as in the phrase).
5 Discussion
We presented here a novel technique for the calculation of inter-rater reliability in both overall
corpora and smaller subsets thereof, and in cases where the categories to be coded are not
specified apriori.
The original data described here consisted of a large corpus for which inter-rater reliability
could be computed; however any resulting statistic would be an omnibus statistic. As a result,
to get a better idea of the inter-rater agreement on smaller subsets of data, 2 ×2 tables were
extracted from within the corpus for analysis. Due to the nature of the data, these 2 ×2
tables were necessarily unbalanced, and when 2 ×2 tables are not balanced, the direct
computation of rfrom its definition can give highly misleading results (Rosenthal and Rubin
2003). Because this was the case with the data presented here, we used a different statistic,
r-equivalent (Rosenthal and Rubin 2003). However, computing an accurate r-equivalent
requires an accurate pvalue, which often cannot be obtained from a χ2in this situation. As
a result, a more appropriate pvalue was obtained via Fisher’s Exact Test (although it should
be noted that there are other ways of obtaining accurate pvalues using such resampling
techniques as bootstrapping or jackknifing).
In addition to the novel set of calculations used to obtain the rvalue, the situation presented
here was such that the coding categories were not specified in advance and therefore were
“reverse engineered” after observations were complete. There is much to be investigated
with regard to this technique and how it might impact data analysis. For example, is the .1–.2
decrease in rcaused by the null category acceptable? Or, would the data be more accurately
or usefully represented if the incidences in which a word is transcribed by only one rater
(what was to become the null category) were simply removed? Conversely, would it be more
practical to avoid the situations all together and deem measuring accuracy in seconds too
lofty a goal; making it acceptable to adjust and align the transcripts if, for example, the raters
coded the same words but at times off by a second or two?
123
Author's personal copy
A novel rater agreement methodology
Though the situation of transcribing parrot speech is a novel one, it is our hope that the
method is useful for situations in which (1) post-hoc categories must be developed for inter-
rater reliability, (2) a null category is useful, (3) the overall matrix will yield a less helpful
omnibus statistic, and/or (4) subsets of the overall matrix are unbalanced enough to preclude
the use of a χ2test. We hope that the techniques presented here will be of use to researchers
studying speech and language, and we look forward to further discussion on the validity and
implications of our method for data analysis.
References
Brennan, R.L., Light, R.J.: Measuring agreement when two observers classify people into categories not
defined in advance. Br. J. Math. Stat. Psychol. 27, 154–163 (1974)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)
Colbert-White, E.N., Covington, M.A., Fragaszy, D.M.: Social context influences the vocalizations of a home-
raised African grey parrot (Psittacus erithacus erithacus). J. Comp. Psychol. 125, 175–184 (2011). doi:10.
1037/a0022097
Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1941)
Van Geert, P., Van Dijk, M.: Ambiguity in child language: the problem of interobserver reliability in ambiguous
observation data. First Lang. 23, 259–284 (2003)
Hubert, L.: Nominal scale response agreement as a generalized correlation. Br. J. Math. Stat. Psychol. 30,
98–103 (1977)
Kaufman, A.B., Rosenthal, R.: Can you believe my eyes? The importance of interobserver reliability statistics
in observations of animal behaviour. Anim. Behav. 78, 1487–1491 (2009)
Krippendorff, K.: Reliability of binary attribute data. Biometrics 34, 142–144 (1978)
Lindsay, J., O’Connell, D.C.: How do transcribers deal with audio recordings of spoken discourse? J. Psy-
cholinguist. Res. 24, 101–115 (1995)
Montgomery, A.C., Crittenden, K.S.: Improving coding reliability for open-ended questions. Public Opin. Q.
41, 235–243 (1977)
Popping, R.: Traces of agreement: on the DOT-product as a coefficient of agreement. Qual. Quant. 17, 1–18
(1983)
Popping, R.: Traces of agreement: on some agreement indices for open-ended questions. Qual. Quant. 18,
147–158 (1984)
Rosenthal, R.: Conducting judgment studies: some methodological issues. In: Harrigan, J., Rosenthal, R.,
Scherer, K. (eds.) The new handbook of methods in nonverbal behavior research, pp. 199–236. Oxford
University Press, New York (2005)
Rosenthal, R., Rubin, D.B.: A simple, general purpose display of magnitude of experimental effect. J. Educ.
Psychol. 74, 166–169 (1982)
Rosenthal, R., Rubin, D.B.: r-equivalent: a simple effect size indicator. Psychol. Methods 8, 492–496 (2003)
Rosenthal, R., Rosnow, R.: Essentials of behavioral research: methods and data analysis. McGraw-Hill, New
York (2008)
Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 17, 321–325
(1955)
Siegel, S.: Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York (1956)
Snedecor, G.W., Cochran, W.G.: Statistical methods. Iowa State University Press, Ames (1989)
Stockman, I.: Listener reliability in assigning utterance boundaries in children’s spontaneous speech. Appl.
Psycholinguist. 31, 363–395 (2010)
Tinsley, H.E.A., Weiss, D.J.: Interrater reliability and agreement of subjective judgments. J. Couns. Psychol.
22, 358–376 (1975)
123
Author's personal copy
ResearchGate has not been able to resolve any citations for this publication.
Chapter
For many years the Handbook of Methods in Nonverbal Behavior Research (Scherer & Ekman, 1982) has been an invaluable text for researchers looking for methods to study nonverbal behavior and the expression of affect. A successor to this essential text, The New Handbook of Methods in Nonverbal Behavior Research is a substantially updated volume with 90% new material. It includes chapters on coding and methodological issues for a variety of areas in nonverbal behavior: facial actions, vocal behavior, and body movement. Issues relevant to judgment studies, methodology, reliability, analyses, etc. have also been updated. The topics are broad and include specific information about methodology and coding strategies in education, psychotherapy, deception, nonverbal sensitivity, and marital and group behavior. There is also a chapter detailing specific information on the technical aspects of recording the voice and face, and specifically in relation to deception studies. This volume will be valuable for both new researchers and those already working in the fields of nonverbal behavior, affect expression, and related topics. It will play a central role in further refining research methods and coding strategies, allowing a comparison of results from various laboratories where research on nonverbal behavior is being conducted. This will advance research in the field and help to coordinate results so that a more comprehensive understanding of affect expression can be developed.
Article
'Judgment studies' refers most generally to those studies in which behaviors, persons, objects, or concepts are evaluated by one of more judges - the general experts, specialist-experts, members of the general public, college students, and the like. The chapter considers some of the fundamental methodological issues that contemporary researchers will want to consider when they conduct judgment studies including issues of the nature of judgment studies, the reliability of judgments, the selection of judges, the formation of composite variables, and some related topics. Judgment studies may focus on nonverbal behaviors considered as either dependent or independent variables. These studies may employ a variety of metrics, from physical units of measurement to psychological units of measurement. The purpose of judgment studies maybe based on different focus - encoder state or other attributes, the encoder's nonverbal behavior, or the decoder's judgment itself.
Article
The prime object of this book is to put into the hands of research workers, and especially of biologists, the means of applying statistical tests accurately to numerical data accumulated in their own laboratories or available in the literature.