ArticlePDF Available

Abstract and Figures

Although theory‐of‐mind (ToM) development is well documented for early childhood, there is increasing research investigating changes in ToM reasoning in middle childhood and adolescence. However, the psychometric properties of most advanced ToM measures for use with older children and adolescents have not been firmly established. We report on the reliability and validity of widely used, conventional measures of advanced ToM with this age group. Notable issues with both reliability and validity of several of the measures were evident in the findings. With regard to construct validity, results do not reveal a clear empirical commonality between tasks, and, after accounting for comprehension, developmental trends were evident in only one of the tasks investigated. Statement of contribution What is already known on this subject? Second‐order false belief tasks have acceptable internal consistency . The Eyes Test has poor internal consistency. Validity of advanced theory‐of‐mind tasks is often based on the ability to distinguish clinical from typical groups. What does this study add? This study examines internal consistency across six widely used advanced theory‐of‐mind tasks. It investigates validity of tasks based on comprehension of items by typically developing individuals. It further assesses construct validity, or commonality between tasks.
Content may be subject to copyright.
British Journal of Developmental Psychology (2017)
©2017 The British Psychological Society
www.wileyonlinelibrary.com
Brief report
Reliability and validity of advanced theory-of-mind
measures in middle childhood and adolescence
Elizabeth O. Hayward
1
* and Bruce D. Homer
2
1
New York University, New York, USA
2
The Graduate Center, City University of New York, New York, USA
Although theory-of-mind (ToM) development is well documented for early childhood,
thereisincreasing research investigating changes in ToM reasoning in middle childhood and
adolescence. However, the psychometric properties of most advanced ToM measures for
use with older children and adolescents have not been firmly established. Wereport on the
reliability and validity of widely used, conventional measures of advanced ToM with this age
group. Notable issues with both reliability and validity of several of the measures were
evident in the findings. With regard to construct validity, results do not reveal a clear
empirical commonality between tasks, and, after accounting for comprehension,
developmental trends were evident in only one of the tasks investigated.
Statement of contribution
What is already known on this subject?
!Second-order false belief tasks have acceptable internal consistency.
!The Eyes Test has poor internal consistency.
!Validity of advanced theory-of-mind tasks is often based on the ability to distinguish clinical from
typical groups.
What does this study add?
!This study examines internal consistency across six widely used advanced theory-of-mind tasks.
!It investigates validity of tasks based on comprehension of items by typically developing individuals.
!It further assesses construct validity, or commonality between tasks.
Despite consensus regarding measurement of theory of mind (ToM) in preschool, how
best to assess mental state reasoning beyond preschool is a topic of debate (Miller, 2012).
Two widely used measures of advanced ToM, second-order false belief and interpretive
tasks, assess an individual’s ability to reconcile multiple beliefs (Astington, Pelletier, &
Homer, 2002; Carpendale & Chandler, 1996). Second-order false belief tasks are concerned
with the understanding that a particular belief can motivate behaviour. In these tasks,
children are asked to predict a character’s actions based on thatcharacter’s false belief about
another character’s belief. Perner and Wimmer (1985) first developed this type of task,
extending the false belief paradigm from beliefs about locations to beliefs about beliefs.
In an interpretive task, a child is shown two interpretations of ambiguous stimuli and
then asked to judge others’ interpretations of those stimuli. Interpretive tasks are
*Correspondence should be addressed to Elizabeth O. Hayward, New York University, New York, NY, USA (email:
elizabeth.hayward@nyu.edu).
DOI:10.1111/bjdp.12186
1
concerned with the understanding that multiple people can have many different beliefs.
Carpendale and Chandler (1996) hypothesized that children achieve false belief
understanding several years before developing an appreciation of the interpretive nature
of the knowing process. These measures introduce stimuli that provide equal support for
two distinct interpretations.
Advanced ToM has also been assessed using the Strange Stories task, which requires
interpreting non-literal statements, such as ironic jokes, lies, and gaffes, in the context of
social narratives (Happ!
e, 1994). Similarly, the Faux-pas Recognition task assesses whether
children can accurately recognize when a social faux pas has occurred (Baron-Cohen,
O’Riordan, Stone, Jones, & Plaisted, 1999).In both tasks, given a short story, participants
are asked to make an inference about the beliefs of the characters. The Strange Stories and
Faux Pas tasks were both originally developed to illuminate the difference in ToM
reasoning between children with autism and those who are typically developing.
The Reading-the-Mind-in-Eyes test seeks to measure ToM ability through accuracy in
reading states of the mind from images of eyes (Baron-Cohen, Wheelwright, Spong, Scahill, &
Lawson, 2001). As a measure of advanced ToM (Baron-Cohen et al.,2015),theEyesTestis
concerned with children’s ability to infer mental states, typically affective states, given
minimal information about the individual or social context. In this 28-item test, participants
must select the appropriate affective term matching an image of eyes. This task was originally
designed to capture subtle deficits in social cognition in individuals on the autism spectrum.
The internal consistency reliability of second-order false belief tasks, similar to those
designed by Perner and Wimmer (1985), has been assessed and found to be acceptable
(Hughes et al.,2000).Thereisascarcityofresearchestablishinginternalconsistencyinother
advanced ToM measures. The internal consistency of the Eyes Test has been found to be poor
(Harkness, Jacobson, Duong, & Sabbagh, 2010; Olderbak et al.,2015;Vellanteet al.,2013;
Voracek & Dressler, 2006). Regarding validity, recent work has made a case for the validity of
the Strange Stories in middle childhood (Devine & Hughes, 2016). Although several of these
tasks identify social-cognitive deficits in clinical populations, little research has explored the
validity of most of these measures when used with typically developing groups.
Therefore, this study aimed to evaluate the reliability and validity of widely used
measures of advanced ToM with typically developing children ages 713. Although other
measures of advanced ToM have been developed, and in some cases validated (e.g., Bosco,
Gabbatore, Tirassa, & Testa, 2016; Devine & Hughes, 2013; Hayward, Homer, & Sprung,
2016; Hutchins, Prelock, & Bonazinga, 2012; Sivaratnam, Cornish, Gray, Howlin, &
Rinehart, 2012), second-order false belief tasks, interpretive tasks, the Strange Stories, the
Faux Pas test, and the Reading-the-Mind-in-the-Eyes task remain some of the most broadly
used ToM measures. The overall comprehension of these tasks was also examined, as valid
measurement hinges on the assumption that participants comprehend the materials
(Fantuzzo, McDermott, Manz, & Hampton, 1996). Regarding construct validity, as evident
in associations between tasks, we predict that these tasks will be moderately intercor-
related. Finally, developmental changes in performance on the advanced ToM tasks were
examined, and we predicted task performance would reflect age-related trends.
Materials and method
Participants
Children (N=112) aged 7:513:5 years were recruited from an independent school and
summer day camp programme in New York City. There were 64 (57%) females and 48
2Elizabeth O. Hayward and Bruce D. Homer
(43%) males. The population from which the sample was recruited is primarily middle-
and upper-class. Demographic data on the participants’ race and ethnicity were not
collected; school-level data indicate that the student population is 72% White/Caucasian,
10% Black/African American, 6% Asian/Pacific Islander, and 4% Hispanic. Two of the 112
participants were lost due to attrition after a single testing session, resulting in partial data.
Measures
The measures for this study were commonly used advanced ToM tasks. Except where
noted, all tasks were administered and scored as described in the original studies. The
measures were as follows: two-second-order false belief tasks (Astington et al., 2002);
two interpretive ambiguous figure tasks, in which the child is asked what a character will
think in response to an ambiguous line drawing (Carpendale & Chandler, 1996); two
interpretive restricted-view tasks, in which the child is asked to guess what a character
will think of a restricted-view picture (Lalonde & Chandler, 2002); the original 24 Strange
Stories vignettes (Happ!
e, 1994); the 10 Faux Pas vignettes (Baron-Cohen et al., 1999);
and the 28-item children’s Reading-the-Mind-in-the-Eyes test (Baron-Cohen et al., 2001).
The Astington et al. (2002) second-order false belief tasks were used in an attempt to
adequately capture age-related variation in second-order reasoning in children over age 7,
while also minimizing information processing demands (Sullivan, Zaitchik, & Tager-
Flusberg, 1994). For all tasks, participants were awarded 1 point for each item answered
correctly, as scored by their original authors. This resulted in one total score for each task,
with the exception of the Strange Stories, for which an additional ‘mentalizing’ score was
included. This score reflects the presence or absence of mental states employed to justify
the utterances characters in each story, as outlined by Happ!
e (1994).
Comprehension scores were calculated for those tasks that included control
questions. The range for comprehension scores varied by task, as follows: second-order
false belief tasks, 09; interpretive restricted-view tasks, 02; the Strange Stories, 024;
and the Faux Pas task, 020. The interpretive ambiguous figure tasks and Eyes Test did not
include comprehension questions.
Procedure
Children were tested individually (ages 78) or in a small group (ages 913) in a quiet room
in their school or camp by one of three researchers. Each participant received a packet
containing two-second-order false belief tasks, two interpretive ambiguous figure tasks,
two interpretive restricted-view tasks, the Strange Stories test, the Faux Pas vignettes, and
the Eyes Test, which was completed over the course of two 30- to 45-minute sessions.
Task order was counterbalanced. Materials were read aloud to all participants.
Participants ages 913 were randomly assigned to groups.
Results
Reliability
An alevel of .70 or above is recognized as indicating acceptable internal consistency,
while those between .60 and .70 are considered undesirable or minimally acceptable, and
those below .60 are unacceptable (Devellis, 2012). Cronbach’s alpha coefficients for the
tasks were as follows: second-order false belief, a=.53; interpretive ambiguous figure,
Reliability and validity of advanced theory of mind 3
a=.77; the interpretive restricted-view, a=.62; the Strange Stories, a=.65; the Strange
Stories mentalizing, a=.73; the Faux Pas, a=.78; and the Eyes, a=.41.
To examine whether internal consistency varied by age, we calculated Cronbach’s
alpha coefficients for three age groups (78 years, n=37; 910 years, n=37; and 11
12 years, n=38) for each measure. Results are presented in Table 1. Among the 10-year-
olds, there was insufficient variance to calculate the Cronbach’s alpha coefficient for the
second-order false belief, as all but one response to one item were correct.
Task comprehension
Proficiency on the comprehension or memory questions was assessed for those tasks that
included such questions. For the second-order false belief tasks, participants performed
well on comprehension questions (M=8.64, SD =0.63), as most participants (71.4%)
responded correctly to all nine questions. On the interpretive restricted-view task,
performance was also strong (M=1.92 SD =0.32), with 93% of participants answering
both memory questions correctly.
Notable issues with comprehension were evident for both the Strange Stories and the
Faux Pas tasks. On the Strange Stories task, which included a single comprehension
question per story, the average score for the comprehension questions was 21.84
(SD =2.4). However, only 12.5% of participants answered all comprehension questions
correctly, suggesting significant issues of comprehension. On one item, the percentage of
participants who correctly answered the comprehension question was as low as 56.8%,
suggesting that many participants were not recognizing the non-literal statement. On the
Faux Pas task, which included two comprehension questions per story, responses to
comprehension questions were variable (M=18.8, SD =1.6), with 49.1% of participants
answering all comprehension questions correctly.
To circumvent these comprehension concerns in assessing associations between
tasks, items for which the comprehension questions were answered correctly by <95% of
the sample were excluded. This resulted in an 11-story Strange Stories set, excluding two
each of the Pretend, Joke, Misunderstanding, and Double Bluff stories, and one each of
the Figure of Speech,Persuade,Contrary Emotions,Appearance/Reality, and Forget
stories; the remaining stories showed strong comprehension (M=10.78, SD =0.44). An
abbreviated six-vignette Faux Pas task was formed, excluding the Story Competition,
Table 1. Descriptive statistics and Cronbach’s alpha coefficients by task and age
78 years 910 years 1112 years
M(SD)aM(SD)aM(SD)a
Second-Order FB (2) 1.81 (0.52) .72 1.89 (0.31)
a
1.89 (0.39) .66
Ambiguous Figure (2) 1.84 (1.32) .68 1.78 (1.33) .81 1.51 (1.43) .81
Restricted-view Task (2) 1.05 (0.91) .77 1.53 (0.70) .51 1.38 (0.76) .47
Strange Stories (24) 17.76 (3.34) .73 19.08 (1.92) .25 19.92 (2.63) .65
Strange Stories Mental (24) 18.03 (3.86) .79 19.36 (3.16) .70 18.57 (3.30) .70
Faux Pas (10) 6.30 (2.59) .76 6.83 (2.55) .77 6.37 (2.91) .82
Eyes Task (28) 19.28 (3.00) .44 19.69 (2.98) .44 20.29 (2.65) .33
a
Insufficient variance in responses to calculate Cronbach’s alpha coefficient.
4Elizabeth O. Hayward and Bruce D. Homer
Lunch Lady, Sally-Mary, and Surprise Party stories, again with strong comprehension
(M=11.64, SD =0.96).
Task associations
Correlational analyses were conducted to investigate associations between tasks (see
Table 2). Three outliers (2 SD <M) were excluded, resulting in a sample of 107 for these
analyses. Bivariate correlation analyses revealed a significant association between age and
the interpretive restricted-view task, r(106) =.19, p=.048. Partial correlation analyses
between the six tasks, controlling for age, revealed associations between the second-order
false belief task and the interpretive restricted-view task, the abbreviated Strange Stories
mentalizing score and the Eyes Test, and the abbreviated Strange Stories total score and
the abbreviated Strange Stories mentalizing score. Spearman rho correlations were also
conducted to account for the varied range of the measures (i.e., from 02 to 028); results
confirmed similar associations.
Developmental change
To further examine whether there were significant developmental changes in perfor-
mance on the tasks, a series of one-way ANOVAs were conducted with age group (7
8 years, n=37; 910 years, n=37; and 1112 years, n=38) as the independent
variable and scores on the advanced ToM tasks as dependent variables. The abbreviated
Strange Stories and abbreviated Faux Pas sets were employed in place of the original tasks.
Age differences were found only for the interpretive restricted-view task, F(2,
107) =3.40, p=.037. Planned contrasts (one-tailed) indicated that the youngest age
group (78 years; M=1.05, SD =.91) was weaker in their performance as compared to
both the middle age group (910 years; M=1.53, SD =.70) and the oldest age group
(1112 years; M=1.38, SD =.76), which were statistically equivalent.
Discussion
These results raise questions about the reliability and validity of several measures of
advanced ToM. The internal consistency for these six tasks ranged widely. The second-
order false belief task was found to have unacceptable internal consistency, although this
is likely due to lack of variance because of a ceiling effect. Two measures, the interpretive
ambiguous figures and the Faux Pas test, demonstrated acceptable internal consistency.
The Strange Stories mentalizing score also demonstrated acceptable internal consistency.
The interpretive restricted-view task and the Strange Stories task had undesirable levels of
internal consistency; in the case of the interpretive restricted-view task, internal
consistency decreased with age across the three groups. The Eyes Test demonstrated
unacceptably low internal consistency across age groups, confirming previous findings
(Olderbak et al., 2015). No clear age-related trends emerged, casting doubt on the notion
that any single measure is particularly more reliable with younger versus older children, or
vice versa.
Comprehension performance on the Strange Stories and the Faux Pas vignettes raised
concerns around the validity of these widely used measures (Fantuzzo et al., 1996). The
current data suggest these tasks in their original form may not be appropriate with this
Reliability and validity of advanced theory of mind 5
Table 2. Descriptive statistics, correlations with age, and partial correlations between tasks
M(SD) Age (r)
Ambig
Figure (pr)
Restricted-
view task (pr)
Strange
stories (pr)
Strange stories
mental (pr) Faux pas (pr) Eyes task (pr)
Second-Order FB (2) 1.86 (0.42) .09 .05 .22* .12 .06 ".002 .12
Ambig Figure (2) 1.71 (1.36) ".06 .11 .07 ".05 .002 .002
Restricted-view Task (2) 1.32 (0.81) .19* ".02 ".03 ".09 ".07
Abbr. Strange Stories (11) 9.06 (1.34) .18 .22* .03 .12
Abbr. Strange Stories
Mentalizing (11)
9.00 (1.67) ".04 .09 .19*
Abbr. Faux Pas (6) 4.23 (1.77) ".08 .02
Eyes Task (28) 19.68 (2.90) .17
*p<.0.5
6Elizabeth O. Hayward and Bruce D. Homer
population. Future research with older children and adolescents should employ only
those items with the highest rates of comprehension.
Previous research presents mixed results on associations between these types of tasks
(Brent, Rios, Happ!
e, & Charman, 2004; Mitroff, Sobel, & Gopnik, 2006). When
considering only those tasks with sound comprehension, the current results fail to
provide evidence of validity of a unified advanced ToM construct in children between the
ages of 7 and 13. It is possible that the underlying abilities assessed by these measures form
at best a constellation of loosely related social-cognitive skills.
Developmental trends were found only for the interpretive restricted-view task:
Performance on the other advanced ToM tasks did not improve with age. Older children
evidently approach ceiling on second-order false belief and interpretive ambiguous figure
tasks, limiting the utility of these as measures of advanced ToM with older groups.
Previous research has demonstrated an effect for age on the Strange Stories and Faux Pas
task (Banerjee, 2000; Banerjee & Watling, 2005; Baron-Cohen et al., 1999; O’Hare,
Bremner, Nash, Happ!
e, & Pettigrew, 2009). However, not all research has identified age-
related trends in Strange Stories performance during adolescence (Bosco, Gabbatore, &
Tirassa, 2014). Given the comprehension difficulties documented here, the variability that
was attributed to age in some research may have been related to issues of comprehension.
Consistent with the current results, typically developing children tend to perform well on
the Eyes Test from 7 years of age, limiting variability due to age in older groups (Brent
et al., 2004; Dorris, Espie, Knott, & Salt, 2004). However, Baron-Cohen et al. (2001) did
find an effect for age in performance on the Eyes Test, such that 8- to 10-year-olds
outperformed 6- to 8-year-olds, in contrast to the current findings.
The socioeconomic and cultural homogeneity of the current sample is a limitation with
regard to the generalizability of the findings. However, the minimal diversity in the current
sample does ensure that these findings can be interpreted in the context of previous work
in this field (Miller, 2012). Nonetheless, future work should employ these tasks with more
heterogeneous populations. Despite these limitations, the current study highlights the
issues with both reliability and validity of several advanced ToM tasks. These results
emphasize the need for advanced ToM measures that accurately capture developments in
social cognition beyond early childhood.
Acknowledgements
The authors would like to thank the individuals who participated in this research. They would
also like to express their gratitude to Yolanta Kornak and Seamus Donnelly for their assistance
with data collection.
References
Astington, J. W., Pelletier, J., & Homer, B. (2002). ToM and epistemological development: The
relation between children’s second-order false-belief understanding and their ability to reason
about evidence. New Ideas in Psychology. Special Issue: Folk Epistemology,20(23), 131144.
https://doi.org/10.1016/s0732-118x(02)00005-3
Banerjee, R. (2000). The development of an understanding of modesty. British Journal of
Developmental Psychology,18(4), 499517. https://doi.org/10.1348/026151000165823
Banerjee, R., & Watling, D. (2005). Children’s understanding of faux pas: Associations with peer
relations. Hellenic Journal of Psychology,2(1), 2745.
Reliability and validity of advanced theory of mind 7
Baron-Cohen, S., Bowen, D. C., Holt, R. J., Allison, C., Auyeung, B., Lombardo, M. V., ... Lai, M.
(2015). The “reading the mind in the eyes” Test: complete absence of typical sex difference in
~400 men and women with autism. PLoS ONE,10(8), e0136521. https://doi.org/10.1371/
journal.pone.0136521
Baron-Cohen, S., O’Riordan, M., Stone, V., Jones, R., & Plaisted, K. (1999). Recognition of faux pas by
normally developing children with Asperger syndrome or high-functioning autism. Journal of
Autism and Developmental Disorders,29(5), 407418. https://doi.org/10.1023/A:10230
35012436
Baron-Cohen, S., Wheelwright, S., Spong, A., Scahill, V., & Lawson, J. (2001). Studies of ToM: Are
intuitive physics and intuitive psychology independent? Journal of Developmental and
Learning Disorders,5(1), 5182.
Bosco, F. M., Gabbatore, I., & Tirassa, M. (2014). A broad assessment of theory of mind in
adolescence: The complexity of mindreading. Consciousness and Cognition,24, 8497.
https://doi.org/doi.org/10.1016/j.concog.2014.01.003
Bosco, F. M., Gabbatore, I., Tirassa, M., & Testa, S. (2016). Psychometric properties of the theory of
mind assessment scale in a sample of adolescents and adults. Frontiers in Psychology,7, 566.
https://doi.org/10.3389/fpsyg.2016.00566
Brent, E., Rios, P., Happ!
e, F., & Charman, T. (2004). Performance of children with autism spectrum
disorder on advanced ToM tasks. Autism: the International Journal of Research and Practice,8
(3), 283299. https://doi.org/10.1177/1362361304045217
Carpendale, J. I., & Chandler, M. J. (1996). On the distinction between false belief understanding and
subscribing to an interpretive ToM. Child Development,67, 16861706. https://doi.org/10.
2307/1131725
Devellis, R. F. (2012). Scale development (3rd ed.). Thousand Oaks, CA: Sage Publications.
Devine, R. T., & Hughes, C. (2013). Silent films and strange stories: Theory of mind, gender, and
social experiences in middle childhood. Child Development,84, 9891003. https://doi.org/10.
1111/cdev.12017
Devine, R. T., & Hughes, C. (2016). Measuring theory of mind across middle childhood: Reliability
and validity of the Silent Films and Strange Stories tasks. Journal of Experimental Child
Psychology,149, 2340. https://doi.org/10.1016/j.jecp.2015.07.011
Dorris, L., Espie, C. A. E., Knott, F., & Salt, J. (2004). Mind-reading difficulties in the siblings of people
with Asperger’s syndrome: Evidence for a genetic influence in the abnormal development of a
specific cognitive domain. Journal of Child Psychology and Psychiatry,45, 412418. https://d
oi.org/10.1111/j.1469-7610.2004.00232.x
Fantuzzo, J. W., McDermott, P. A., Manz, P. H., & Hampton, V. R. (1996). The pictorial scale of
perceived competence and social acceptance: Does it work with low-income urban children?
Child Development,67, 10711084. https://doi.org/10.2307/1129772
Happ!
e, F. G. (1994). An advanced test of ToM: Understanding of story characters’ thoughts and
feelings by able Autistic, Mentally Handicapped, and normal children and adults. Journal of
Autism and Developmental Disorders,24(2), 129154. https://doi.org/10.1007/BF02172093
Harkness, K. L., Jacobson, J. A., Duong, D., & Sabbagh, M. A. (2010). Mental state decoding in past
major depression: Effect of sad versus happy mood induction. Cognition & Emotion,24(3),
497513. https://doi.org/10.1080/02699930902750249
Hayward, E. O., Homer, B. D., & Sprung, M. (2016). Developmental trends in flexibility and
automaticity of social cognition. Child Development. Advance online publication. https://doi.
org/10.1111/cdev.12705
Hughes, C., Adlam, A., Happ!
e, F., Jackson, J., Taylor, A., & Caspi, A. (2000). Good testretest
reliability for standard and advanced false-belief tasks across a wide range of abilities. Journal of
Child Psychology and Psychiatry,41(4), 483490. https://doi.org/10.1017/s00219630
99005533
Hutchins, T. L., Prelock, P. A., & Bonazinga, L. (2012). Psychometric evaluation of the theory of mind
inventory (ToMI): A study of typically developing children and children with autism spectrum
8Elizabeth O. Hayward and Bruce D. Homer
disorder. Journal of Autism and Developmental Disorders,42, 327341. https://doi.org/0.
1007/s10803-011-1244-7
Lalonde, C. E., & Chandler, M. J. (2002). Children’s understanding of interpretation. New Ideas in
Psychology Special Issue: Folk Epistemology,20, 163198. https://doi.org/10.1016/S0732-
118X(02)00007-7
Miller, S. A. (2012). Theory of mind: Beyond the preschool years. New York, NY: Psychology Press.
Mitroff, S. R., Sobel, D. M., & Gopnik, A. (2006). Reversing how to think about ambiguous figure
reversals: Spontaneous alternating by uninformed observers. Perception,35, 709715. https://d
oi.org/10.1167/6.6.52
O’Hare, A. E., Bremner, L., Nash, M., Happ!
e, F., & Pettigrew, L. M. (2009). A clinical assessment tool
for advanced ToM performance in 5 to 12 year olds. Journal of Autism and Developmental
Disorders,39, 916928. https://doi.org/10.1007/s10803-009-0699-2
Olderbak, S., Wilhelm, O., Olaru, G., Geiger, M., Brenneman, M. W., & Roberts, R. D. (2015). A
psychometric analysis of the reading the mind in the Eyes Test: Toward a brief form for research
and applied settings. Frontiers in Psychology,6, 1503. https://doi.org/10.3389/fpsyg.2015.
01503
Perner, J., & Wimmer, H. (1985). “John thinks that Mary thinks that.”: Attribution of second-order
beliefs by 5- to 10-year-old children. Journal of Experimental Child Psychology,39, 437471.
https://doi.org/10.1016/0022-0965(85)90051-7
Sivaratnam, C. S., Cornish, K., Gray, K. M., Howlin, P., & Rinehart, N. J. (2012). Brief report:
Assessment of the social-emotional profile in children with autism spectrum disorders using a
novel comic strip task. Journal of Autism and Developmental Disorders,42, 25052512.
https://doi.org/10.1007/s10803-012-1498-8
Sullivan, K., Zaitchik, D., & Tager-Flusberg, H. (1994). Preschoolers can attribute second-order
beliefs. Developmental Psychology,30(3), 395402. https://doi.org/10.1037/0012-1649.30.3.
395
Vellante, M., Baron-Cohen, S., Melis, M., Marrone, M., Petretto, D. R., Masala, C., & Preti, A. (2013).
The “Reading the Mind in the Eyes” test: Systematic review of psychometric properties and a
validation study in Italy. Cognitive Neuropsychiatry,18, 326354. https://doi.org/10.1080/
13546805.2012.721728
Voracek, M., & Dressler, S. G. (2006). High (feminized) digit ratio (2D: 4D) in Danish men: A question
of measurement method? Human Reproduction,21, 13291331. https://doi.org/10.1093/
humrep/dei464
Received 16 August 2016; revised version received 13 March 2017
Reliability and validity of advanced theory of mind 9
... On the one hand, measurements that should be related are, in fact, not always related. For instance, Hayward and Homer (2017) found that Second-Order False Belief and Strange Stories tasks were unrelated among 7-13-year-olds (n = 107; r = 0.06). Also, measurements that should be unrelated in the Quesque and Rossetti analysis sometimes actually are related. ...
... Inconsistencies in the literature may also stem from different age ranges found between samples. There was a narrower age range in Osterhaus et al. (2016) than in Hayward and Homer (2017). In general, the narrower the age range, the less likelihood there is of conflating developmental and individual differences (see Devine, 2021). ...
... The internal consistency of the ToM measures was not strong, overall, in this study. That result is consistent with findings in samples of children and adolescents (e.g., Hayward and Homer, 2017;Osterhaus et al., 2016). The weak correlations could reflect, to some extent at least, the psychometric limitations of the instruments themselves. ...
Article
Full-text available
There are conflicting proposals about the underlying structure of the theory of mind (ToM) construct. The lack of clarity impedes attempts to understand relationships between ToM and other cognitive abilities. This study investigated the nature of the ToM construct and its relation to cognitive variables by administering a battery of ToM measurements along with measurements of executive function and general vocabulary to 207 (Mage = 19.26) adult participants. Associations between ToM tasks were statistically significant after controlling for covariates, but, for the most part, very weak in magnitude. The strongest relationship was between the Strange Stories and Higher-Order False Belief measurements. Previous theoretical analysis proposes those instruments are conceptually linked by a perspective taking requirement that entails representing another’s mental state. Results from a factor analysis suggested an underlying ToM structure—a protagonist perspective factor. The Strange Stories, Higher-Order False Belief, and Frith-Happé Animation tasks loaded onto the factor. Its defining feature is the ascription of mental states to predict and explain protagonists’ actions that take place within a narrative structure. It is related more strongly to vocabulary than executive function and it provides grounds for future research on the role of narrative processing in ToM reasoning.
... ToM tasks with satisfactory reliability are crucial to accurately investigate differences between individuals, between groups, and over time in clinical and research contexts (Davidson et al., 2018;Osterhaus & Bosacki, 2022). Reviews and largescale research have examined the psychometric properties of ToM tasks in children and adolescents (Ahmadi et al., 2015;Beaudoin et al., 2020;Fu et al., 2023;Hayward & Homer, 2017;Poll et al., 2023), adults (Gourlay et al., 2020;Klein et al., 2022 (SCOPE); Yeung et al., 2024), individuals with SZ (Davidson et al., 2018;Pinkham et al., 2014 (SCOPE); Yeh et al., 2021), ASD (Morrison et al., 2019 (SCOPE)), and other neuropsychiatric populations (Eddy, 2019). Results from the large-scale SCOPE project have recommended the use of Hinting Task for SZ and ASD, as well as The Awareness of Social Inference Test-Part three for ASD, but no ToM tasks were recommended for the nonclinical population based on the psychometric properties. ...
... It is possible that these ToM tasks were primarily designed for ASD and SZ populations who are likely to have prominent social cognition deficits. Thus, the tasks present measurement sensitivity in detecting "ToM impairments," rather than capturing the general variation in "ToM ability" which, similar to neurocognitive functions, should be diverse in the general populations (Conway et al., 2019;Fu et al., 2023;Gernsbacher & Yergeau, 2019;Hayward & Homer, 2017;Holt et al., 2022;Marocchini, 2023;Yeung et al., 2024). Individual differences in ToM ability might reflect how easily and fluently adults attribute mental states to others (Hughes & Devine, 2015), or how individuals build internal representations of others' mental states based on a multidimensional "Mind-space" framework (Conway et al., 2019). ...
Article
Full-text available
Though theory of mind (ToM) is an important area of study for different disciplines, however, the psychometric evaluations of ToM tasks have yielded inconsistent results across studies and populations, raising the concerns about the accuracy, consistency, and generalizability of these tasks. This systematic review and meta-analysis examined the psychometric reliability of 27 distinct ToM tasks across 90 studies involving 2771 schizophrenia (SZ), 690 autism spectrum disorder (ASD), and 15,599 nonclinical populations (NC). Findings revealed that while all ToM tasks exhibited satisfactory internal consistency in ASD and SZ, about half of them were not satisfactory in NC, including the commonly used Reading the Mind in the Eye Test and Hinting Task. Other than that, Reading the Mind in the Eye Test showed acceptable reliability across populations, whereas Hinting Task had poor test–retest reliability. Notably, only Faux Pas Test and Movie for the Assessment of Social Cognition had satisfactory reliability across populations albeit limited numbers of studies. However, only ten studies examined the psychometric properties of ToM tasks in ASD adults, warranting additional evaluations. The study offered practical implications for selecting ToM tasks in research and clinical settings, and underscored the importance of having a robust psychometric reliability in ToM tasks across populations.
... A field of research in the literature on ToM in adolescence is concerned with the validity of measurement tools and measures to assess ToM in adolescence (Hayward and Homer, 2017), such as the Animated Triangle Task (Andersen et al., 2022), the Theory of Mind Assessment Scale (Bosco et al., 2014), EmpaToM-Youth (Breil et al., 2021), and the automated ToM measurement through machine learning and deep learning systems (Devine et al., 2023). ...
... The low index for RMET, a widely used measure in ToM domain, is in line with previous studies(Olderbak et al., 2015;Hayward and Homer, 2017). ...
Article
Full-text available
Introduction Several developmental changes occur in adolescence, particularly in the metarepresentational domain, which allows and promotes adaptive sociality. We explored the possible relationships between theory of mind (ToM) and definitional competence, both metarepresentational, beyond age and gender effects. Methods To reach our goals, we involved 75 adolescents (age range 14–19 years, M = 15.7, and SD = 1.36). ToM was measured through “The Reading the Mind in the Eyes Test” (RMET), and definitional competence was assessed through a new instrument, namely, the “Co.De. Scale”. Attention was paid to check whether results were different when considering mental states vs. non-mental states of the scale and emotional words vs. non-emotional words. Results T-tests showed that older adolescents (third grade of high school) performed better than younger ones (first grade of high school) in both tasks. Only in the male group, there were no school grade differences in the ToM task. Regression analyses showed that RMET performance predicted the score of non-emotional mental states definitions and, even if marginally, of ToM word definitions. However, RMET was not a predictor of the general performance of the definitional task or emotion definitions. Discussion Connections with global adolescents’ development and possible educational implications are discussed.
... The sum of all scores indicated the child's overall performance on the TOM test [29]. The test-re-test measurement was employed [30] to assess the test reliability for children with HI [31]. The test reliability was 0.8 for the first question and 0.756 for the third question. ...
... A variety of tasks have been designed to measure different facets of mentalizing (Happé, 1994;Premack and Woodruff, 1978;Wimmer and Perner, 1983). Unfortunately, these measures exhibit poor convergent validity-performance in one task does not necessarily correlate with any other-and limited predictive validity, with task performance failing to consistently predict socioemotional functioning (Gernsbacher and Yergeau, 2019; Hayward and Homer, 2017). This limits the extent to which performance on a single task can be taken as evidence of ToM more generally, and underscores the need for running varied, tightly controlled experiments, each measuring distinct aspects of mentalizing. ...
Article
Full-text available
We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPITOME: a battery of six experiments that tap diverse ToM capacities, including belief attribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is sufficient to explain ToM in humans. We compare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the basis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.
Article
Full-text available
Adolescence is a developmental period characterized by significant changes and intensified social interactions. The role of parents decreases and the importance of peer groups increases. Peers, especially friends, may deliver instrumental aid and emotional support; they may also promote a sense of security and be a significant source of affection and intimacy. Additionally, peer relations provide a testing ground for exercising many competencies necessary in complex social situations, such as social problem-solving, conflict resolution, and negotiation. The intensified contact with peers may also enhance adolescents’ social understanding skills. Therefore, practicing social understanding skills within a peer group may enhance one’s social functioning in adolescence. For these practical and educational reasons, we aimed to confirm the effectiveness of conversation-based training in these skills and identify what factors potentially support or hinder its effectiveness. Social understanding, the ability to understand oneself and others in various social situations, develops in childhood and adolescence. As this ability impacts satisfactory social functioning in adolescence and develops in a social context, a training process was proposed with the aim of enhancing the development of this ability based on the social-constructivist approach to social understanding. The efficacy of the training to enhance the understanding of one’s own and others’ mental states was verified using a sample of 65 Polish adolescents (mean age: 14.6 years). They participated in nine one-hour sessions and were divided into an experimental group (social understanding, n = 26) and two control groups: attention/perception (n = 17) and film/text literacy (n = 22). Although no direct effect of the theory of mind training was found, the results provided important observations for further work on adolescent social understanding training programs.
Article
Full-text available
Age-related changes in flexibility and automaticity of reasoning about social situations were investigated. Children (N = 101; age range = 7;8-17;7) were presented with the flexibility and automaticity of social cognition (FASC), a new measure of social cognition in which cartoon vignettes of social situations are presented and participants explain what is happening and why. Scenarios vary on whether the scenario is socially ambiguous and whether or not language is used. Flexibility is determined by the number of unique, plausible explanations, and automaticity is indicated by speed of response. Overall, both flexibility and automaticity increased significantly with age. Language and social ambiguity influenced performance. Future work should investigate differences in FASC in older populations and clinical groups.
Article
Full-text available
This research aimed at the evaluation of the psychometric properties of the Theory of Mind Assessment Scale (Th.o.m.a.s.). Th.o.m.a.s. is a semi-structured interview meant to evaluate a person's Theory of Mind (ToM). It is composed of several questions organized in four scales, each focusing on one of the areas of knowledge in which such faculty may manifest itself: Scale A (I-Me) investigates first-order first-person ToM; Scale B (Other-Self) investigates third-person ToM from an allocentric perspective; Scale C (I-Other) again investigates third-person ToM, but from an egocentric perspective; and Scale D (Other-Me) investigates second-order ToM. The psychometric proprieties of Th.o.m.a.s. were evaluated in a sample of 156 healthy persons: 80 preadolescent and adolescent (aged 11–17 years, 42 females) and 76 adults (aged from 20 to 67 years, 35 females). Th.o.m.a.s. scores show good inter-rater agreement and internal consistency; the scores increase with age. Evidence of criterion validity was found as Scale B scores were correlated with those of an independent instrument for the evaluation of ToM, the Strange Stories task. Confirmatory factor analysis (CFA) showed good fit of the four-factors theoretical model to the data, although the four factors were highly correlated. For each of the four scales, Rasch analyses showed that, with few exceptions, items fitted the Partial credit model and their functioning was invariant for gender and age. The results of this study, along with those of previous researches with clinical samples, show that Th.o.m.a.s. is a promising instrument to assess ToM in different populations.
Article
Full-text available
The Reading the Mind in the Eyes Test is a popular measure of individual differences in Theory of Mind that is often applied in the assessment of particular clinical populations (primarily, individuals on the autism spectrum). However, little is known about the test's psychometric properties, including factor structure, internal consistency, and convergent validity evidence. We present a psychometric analysis of the test followed by an evaluation of other empirically proposed and statistically identified structures. We identified, and cross-validated in a second sample, an adequate short-form solution that is homogeneous with adequate internal consistency, and is moderately related to Cognitive Empathy, Emotion Perception, and strongly related to Vocabulary. We recommend the use of this short-form solution in normal adults as a more precise measure over the original version. Future revisions of the test should seek to reduce the test's reliance on one's vocabulary and evaluate the short-form structure in clinical populations.
Article
Full-text available
The "Reading the Mind in the Eyes" test (Eyes test) is an advanced test of theory of mind. Typical sex difference has been reported (i.e., female advantage). Individuals with autism show more difficulty than do typically developing individuals, yet it remains unclear how this is modulated by sex, as females with autism have been under-represented. Here in a large, non-male-biased sample we test for the effects of sex, diagnosis, and their interaction. The Eyes test (revised version) was administered online to 395 adults with autism (178 males, 217 females) and 320 control adults (152 males, 168 females). Two-way ANOVA showed a significant sex-by-diagnosis interaction in total correct score (F(1,711) = 5.090, p = 0.024, ηp2 = 0.007) arising from a significant sex difference between control males and females (p < 0.001, Cohen's d = 0.47), and an absence of a sex difference between males and females with autism (p = 0.907, d = 0.01); significant case-control differences were observed across sexes, with effect sizes of d = 0.35 in males and d = 0.69 in females. Group-difference patterns fit with the extreme-male-brain (EMB) theory predictions. Eyes test-Empathy Quotient and Eyes test-Autism Spectrum Quotient correlations were significant only in females with autism (r = 0.35, r = -0.32, respectively), but not in the other 3 groups. Support vector machine (SVM) classification based on response pattern across all 36 items classified autism diagnosis with a relatively higher accuracy for females (72.2%) than males (65.8%). Nevertheless, an SVM model trained within one sex generalized equally well when applied to the other sex. Performance on the Eyes test is a sex-independent phenotypic characteristic of adults with autism, reflecting sex-common social difficulties, and provides support for the EMB theory predictions for both males and females. Performance of females with autism differed from same-sex controls more than did that of males with autism. Females with autism also showed stronger coherence between self-reported dispositional traits and Eyes test performance than all other groups.
Article
Full-text available
Individuals with mild depression show an enhanced ability to read or “decode” others' mental states. The goal of the present study was to investigate whether this pattern of performance is related specifically to the pathology of depression or whether it is simply a feature of the transient dysphoric state. Forty-one undergraduates with a previous episode of major depression and 52 undergraduates with no depression history participated in a mental state decoding task following a sad versus happy mood induction. Previously depressed participants were significantly more accurate in their mental state judgements than were the never-depressed participants, suggesting that enhanced mental state decoding may be a specific feature of depression in remission. Furthermore, previously depressed participants whose positive mood increased in response to the happy mood induction showed a poorer level of performance on the task, similar to that observed in the never-depressed group. Thus, a happy mood may have induced a somewhat less accurate, but perhaps more adaptive, approach to processing social information. These findings were robust after controlling for current level of depression and anxiety symptoms, intensity of response to the mood induction, response times, and performance on a control task.
Book
This book presents complex concepts in a way that helps students to understand the logic underlying the creation, use, and evaluation of measurement instruments and to develop a more intuitive feel for how scales work. Robert DeVellis demystifies measurement by relating it to familiar experiences and by emphasizing a conceptual rather than a strictly mathematical understanding. Students&apos; attention is drawn to important concepts that are foundational for subsequent topics, with opportunities provided to test understanding through chapter summaries and exercises.
Article
This is the first book to provide a comprehensive review of the burgeoning literature on theory of mind (TOM) after the preschool years and the first to integrate this literature with other approaches to the study of social understanding. By highlighting the relationship between early and later developments, the book provides readers with a greater understanding of what we know and what we still need to know about higher-order TOM. Although the focus is on development in typical populations, development in individuals with autism and in older adults is also explored to give readers a deeper understanding of possible problems in development. Examining the later developments of TOM gives readers a greater understanding of: • Developments that occur after the age of 5. • Individual differences in rate of development and atypical development and the effects of those differences. • The differences in rate of mastery which become more marked, and therefore more informative, with increased age. • What it means to have a "good theory of mind." • The differences between first- and second- order theory of mind development in preschoolers, older children, adolescents, and adults. • The range of beliefs available to children at various ages, providing a fuller picture of what is meant by "understanding of belief." After the introduction, the literature on first-order developments during the preschool period is summarized to serve as a backdrop for understanding more advanced developments. Chapter 3 is devoted to the second-order false belief task. Chapters 4 and 5 introduce a variety of other measures for understanding higher-level forms of TOM thereby providing readers with greater insight into other cognitive and social developmental outcomes. Chapter 6 discusses the relation between children's TOM abilities and other aspects of their development. Chapters 7 and 8 place the work in a historical context. First, the research on the development of social and mental worlds that predated the emergence of TOM is examined. Chapter 8 then provides a comparative treatment of the two literatures and how they complement one another. Ideal as a supplement in graduate or advanced undergraduate courses in theory of mind, cognitive development, or social development taught in psychology and education. Veteran researchers will also appreciate this book's unique synthesis of this critical research.
Article
Recent years have seen a growth of research on the development of children's ability to reason about others' mental states (or "theory of mind") beyond the narrow confines of the preschool period. The overall aim of this study was to investigate the psychometric properties of a task battery composed of items from Happé's Strange Stories task and Devine and Hughes' Silent Film task. A sample of 460 ethnically and socially diverse children (211 boys) between 7 and 13years of age completed the task battery at two time points separated by 1month. The Strange Stories and Silent Film tasks were strongly correlated even when verbal ability and narrative comprehension were taken into account, and all items loaded onto a single theory-of-mind latent factor. The theory-of-mind latent factor provided reliable estimates of performance across a wide range of theory-of-mind ability and showed no evidence of differential item functioning across gender, ethnicity, or socioeconomic status. The theory-of-mind latent factor also exhibited strong 1-month test-retest reliability, and this stability did not vary as a function of child characteristics. Taken together, these findings provide evidence for the validity and reliability of the Strange Stories and Silent Film task battery as a measure of individual differences in theory of mind suitable for use across middle childhood. We consider the methodological and conceptual implications of these findings for research on theory of mind beyond the preschool years. Copyright © 2015 Elsevier Inc. All rights reserved.