Content uploaded by Cristian Tejedor-García
Author content
All content in this area was uploaded by Cristian Tejedor-García on Aug 21, 2023
Content may be subject to copyright.
Enhancing ASR-Based Educational Applications: Peer Evaluation of
Non-native Child Speech
Simone Wills1, Cristian Tejedor-Garc´
ıa1, Catia Cucchiarini1, Helmer Strik1
1Radboud University Nijmegen, The Netherlands
simone.wills@ru.nl, cristian.tejedorgarcia@ru.nl, catia.cucchiarini@ru.nl,
helmer.strik@ru.nl
Abstract
With recent advancements in automatic speech recognition
(ASR), ASR-based educational applications have become in-
creasingly viable. This paper presents a preliminary investiga-
tion into whether peer evaluations of the speech produced dur-
ing the use of these applications, by primary school-aged chil-
dren, is reliable and valid. Twenty-one Dutch primary school
children assessed non-native read speech in terms of intelligi-
bility, accuracy, and reading performance. The children’s judge-
ments were compared to those made by adult Dutch speakers,
as well as to the performance of the Whisper ASR system. The
children proved to be reliable raters with agreement levels en
par with the adult group, with findings indicating that primary
school-aged children can provide peer evaluation of speech suit-
able for enhancing the feedback provided by ASR-based lan-
guage learning applications.
Index Terms: speech recognition, peer evaluation, child
speech, Dutch ASR, language-learning games, non-native
speech
1. Introduction
ASR technology is ideally suited for educational applications as
it can be applied directly to the development of two fundamen-
tal skills targeted in child education; speaking and reading. Un-
til recently though, the use of ASR in educational applications
for children has been significantly limited by the challenging
speech within this domain, namely child speech [1] and, as a
result of growing global migration, multilingual and non-native
speech [2, 3]. These challenges are further compounded in low-
resource language contexts.
While these challenges have not been fully resolved, with
state-of-the-art ASR systems such as Whisper AI [4] and
wav2vec2.0 [5] we are getting closer to achieving a reasonable
standard of performance [6]. As a result, there has been an in-
creased focus on how ASR technology can be applied to educa-
tional applications [7]. Of particular interest is how ASR-based
applications can best support and encourage the development of
language skills in children. In light of this, this paper presents
a preliminary investigation into the suitability of including peer
evaluations of speech in ASR-based language learning applica-
tions aimed at primary school-aged children.
Peer evaluation involves the evaluation, critique, or pro-
vision of feedback on the quality, value and/or accuracy of
a peer’s work. This can take both a quantitative or quali-
tative form (e.g. ratings or commentary) [8, 9, 10]. The
benefits of peer evaluation are well documented [11], ranging
from increasing student engagement and motivation, to improv-
ing educational performance by facilitating reflective learning
[12, 13, 14]. Additionally, peer feedback is argued to promote
learning responsibility and critical thinking [11, 15].
The majority of these findings though, emanate from stud-
ies carried out in higher education [11], with limited literature
on the reliability and validity of peer evaluations in primary ed-
ucation. As highlighted by [16], providing effective feedback
or assessment is a cognitively complex task which requires the
evaluator to understand both the goals of the task and the cri-
teria for success, and judge the produced work in relation to
these. This is not only a relevant consideration when imple-
menting peer evaluation with primary school children, but to
the task of assessing speech itself. Assessing speech, and the
different characteristics thereof, is an arguably complex task.
Given the richness of the speech signal, the evaluator has to
contend with overlapping, and often interconnected, streams of
information while attempting to isolate the target characteristic
being evaluated.
The few studies which have looked at peer evaluation in
the context of language learning report positive effects. [13]
found that peer evaluation had a positive impact on Iranian EFL
learners’ reading comprehension, motivation, and vocabulary
learning. A study by [17] demonstrated a statistically significant
impact of peer evaluation on the development of speaking skills.
With ASR-based language learning applications generating
recordings of speech samples of peers, it creates an opportunity
for peer evaluation tasks to be carried out. If peer evaluation,
particularly amongst primary school-aged children, proves to
be reliable and informative, then the benefits of this form of
feedback and assessment can be used to enhance ASR-based
language learning applications. We conducted an investigation
in which Dutch primary school children evaluated non-native
Dutch read speech based on intelligibility, reading accuracy and
reading performance. The children’s judgements were com-
pared to those of adult Dutch speakers and to the output of
Whisper. Our main research questions were;
•RQ1. Do primary school-aged children produce reliable eval-
uations of speech?
•RQ2. How do the children’s assessments compare to those of
the adults?
•RQ3. How do the children’s assessments correspond to eval-
uative information derived from ASR output?
The paper is structured as follows; in Section 2 the mate-
rial, participants and study design are described, followed by
the experimental results in Section 3. In Section 4 the findings
are discussed in relation to the research questions. The paper
concludes with Section 5 with an overview of the work we have
presented, the implications for future work and the limitations
of the study.
9th Workshop on Speech and Language Technology in Education (SLaTE)
18-20 August 2023, Dublin, Ireland
16 10.21437/SLaTE.2023-4
2. Methodology
In order to collect peer evaluations of speech, a listening
test was implemented in the form of an online questionnaire
wherein children were asked to evaluate recordings of other
children reading. Responses were also collected from a small
group of adults to serve as a comparison point.
2.1. Participants
2.1.1. Children
Twenty-one children, ten male and eleven female, enrolled in a
Dutch primary school in the Netherlands, completed the ques-
tionnaire. The children, aged between 10-11 years old, are na-
tive Dutch speakers, with five of the children having one addi-
tional home language. Participation was anonymous, and con-
sent was obtained from parents through consent forms approved
by the Ethics Committee of the Faculty of Arts of Radboud Uni-
versity.
2.1.2. Adults
Four native-Dutch adults, three female and one male, also com-
pleted the questionnaire. They were sourced from colleagues
within the university department. All four adults had experience
or expertise in language and linguistics. Consent was collected
during survey survey and responses were anonymised.
2.2. Questionnaire Design
A Qualtrics1survey was used to collect participant responses.
It is comprised of three tasks in which the children were pre-
sented short recordings of read speech and asked to assess the
intelligibility, reading accuracy, and identify problematic words
for the speakers, respectively.
The survey was presented to the children as a fun listening
exercise. We framed it in this was as the children were complet-
ing an unrelated reading task in the same session as the survey,
and we wanted to reduce the chance of the children becoming
self-conscious of their own reading skills.
2.2.1. Speech Data
The audio recordings used were sourced from the JASMIN cor-
pus [18]. The JASMIN corpus is a Dutch-Flemish corpus of
read and extemporaneous speech collected from children, non-
native speakers, and elderly people. Due to time restraints of
the session in which the responses were collected, the ques-
tionnaire was designed to be relatively short. Eight sentences
were selected, each produced by a different non-native speaker
of Dutch, aged between 11-17 years (Group 3 in the corpus).
Second language (L2) learners were chosen as a reflection of the
increasing diversity and multilingualism in society and schools.
Furthermore, feedback on intelligibility is particularly benefi-
cial for L2 learners.
The sentences were extracted from recordings of children
reading texts intended for L2 Dutch learners. The sentences
were selected based on age-appropriateness for the listeners, in-
telligibility, and the number and type of speech and/or reading
errors, to produce a sample of sentences ranging in difficulty
for each task. During selection the reader’s age, sex and read-
ing level were also taken into account to try create a balanced
sample. The sentences ranged from six to twelve words, except
for one of twenty-one words (S5).
1https://www.qualtrics.com/
2.2.2. Task 1: Intelligibility
In the first task, the children were presented the eight audio clips
and asked to rate each recording, on a scale from 0 to 10, how
intelligible the speaker was. 0 representing nothing is intelligi-
ble, whereas 10 everything is intelligible. In all tasks the chil-
dren were able to replay the recordings.
2.2.3. Task 2: Reading Accuracy
The second task was rating reading accuracy. The children were
presented seven of the eight audio recordings, in a different or-
der to task 1. The text prompt which was being read in the
recording was provided on screen with the audio clip. The chil-
dren were asked to compare what they heard in the recording to
the reading prompt and rate, on a scale from 0 to 10, how accu-
rate the reader was. 0 indicates none of the words in the audio
are the same as the text, with 10 indicating that all of the words
in the audio are the same as in the text.
2.2.4. Task 3: Identifying Problematic Words
The final task was expected to be more complex for the chil-
dren, so only four sentences were presented. In the task the
children were given the audio recording and the reading prompt
for each sentence. Each word in the reading prompt could be
highlighted on screen if the child clicked on the word. The chil-
dren were informed that the speakers they heard were practising
their Dutch, and they could help the speakers by identify which
words the speakers needed to practice. The children were asked
to highlight the words in the reading prompt which the speaker
read incorrectly or which sounded strange, in other words ‘prob-
lematic’ words.
2.3. Procedure
The children completed the questionnaire independently in their
classroom at school, as part of a demonstration and testing of an
ASR-based language learning game. The children were divided
into two groups and while one group tested the language game,
the other completed the questionnaire on tablets provided by the
testers. The two groups then swapped over. The children did not
use headphones when completing the questionnaire. A unique
user id was assigned to each child and entered into the question-
naire by the tester before the child began the questionnaire.
2.4. ASR
To incorporate ASR based measures into the study, we used the
state-of-the-art ASR system, Whisper [4], to obtain the audio
recording transcriptions and the whisper-timestamped2Python
module extension to obtain word confidence scores. Whisper is
a transformer-based model designed for sequence-to-sequence
tasks. The “Whisper-large v2” model was trained using a vast
amount of labelled audio data, amounting to approximately
680,000 hours of labelled data, sourced from the Internet. This
training approach was conducted in a fully supervised manner.
Whisper has demonstrated exceptional performance in terms of
Word Error Rate (WER) on many benchmark datasets for ASR,
including TEDLIUM, librispeech and Common Voice.
2https://github.com/linto-ai/
whisper-timestamped
17
Table 1: Human-based mean scores for intelligibility [1,10] and accuracy [1,10], and ASR-based WER [0,100] and mean word
confidence score [0,100]
Human ASR
Sentence ID Intelligibility Accuracy WER Word Confidence
Child Adult Child Adult (Ave.)
S1 8.91 8.50 4.86 3.0 0 52.80
S2 8.05 6.75 9.0 10.0 0 83.82
S3 5.57 3.25 4.81 5.25 50.0 71.27
S4 8.71 7.0 7.24 7.50 9.09 81.06
S5 7.57 6.5 8.19 7.25 4.55 80.76
S6 8.81 9.0 7.81 7.75 0 71.12
S7 7.19 8.25 7.05 8.0 10.0 74.35
S8 5.33 4.75 - - 0 80.38
3. Results
This section presents our analysis of the children’s judgements
in terms of reliability and validity, as compared to the adult par-
ticipants and ASR system. All values have been rounded to two
decimal places.
Table 1 presents the children’s mean scores for intelligi-
bility and reading accuracy per sentence, in comparison to the
mean scores of the adults. Additionally, two ASR-based mea-
surements are provided as counterpoints to the human ratings;
the WER and the mean word confidence score for each sen-
tence. While word confidence scores are typically reported as
values between 0 and 1, they have been included in the table as
percentages.
3.1. Reliability
3.1.1. ICC2k Scores
To assess the reliability of the children’s judgements the intr-
aclass correlation coefficient (ICC) [19] was calculated. This
measurement compares the variability of ratings for a single
item to the total variation seen across all ratings of all items.
The ICC value ranges from 0 to 1, with 1 indicating perfect re-
liability among raters. There are six different variations of the
ICC described by [20]. The ICC model used in this paper is
the two-way random effects model, measuring absolute agree-
ment estimated on the average of multiple raters/measurement
(ICC2k). In brief, both raters and items are considered a source
of random effects and all the items are rated by all the raters.
The python package Pingouin3was used to calculate the ICC2k
scores.
The ICC2k scores for both the children and the adults are
provided in Table 2. The scores were calculated for both the
intelligibility and the reading accuracy ratings. Both groups
have high ICC scores, with the children receiving an average
ICC2k score across both tasks of 0.87, while the adult group
has a lower average value (0.81).
3.2. Agreement
3.2.1. Intra-group Agreement
Fleiss’ Kappa has been used to measure the level of agreement
between the participants within the two groups in their judge-
ments made in task 3. In calculating Fleiss’ kappa, task 3 is
3https://pingouin-stats.org/build/html/index.
html
Table 2: ICC2k scores for the children and the adults for their
ratings of intelligibility and reading accuracy [0,1]
Task Child Adult
Intelligibility 0.86 0.81
Reading Accuracy 0.88 0.82
treated as a categorisation task in which the participants have
to categorise each word in the provided sentences as ‘problem-
atic’ (for the speaker) or ‘non-problematic’. When a participant
highlights a word it is interpreted as categorising the word as
‘problematic’, non-highlighted words are categorised as ‘non-
problematic’ by default. The kappa value is reported for the
children and adult group, per sentence, in Table 3. The inter-
pretation of level of agreement is based on the guideline pro-
vided by [21]. Table 3 also includes the percentage of words
in the sentence which have been categorised as ‘problematic’.
The agreement between participants varies across sentences.
The children achieve slight to moderate levels of agreement,
whereas this ranges from fair to almost perfect agreement for
the adults. There is, though, a moderate or higher agreement
within both groups for the same two sentences, S1 and S4.
Table 3: Fleiss’ Kappa [0,1] of the first four sentences for adults
and children
Group Sentence ID Problematic
Words (%)
Fleiss’
Kappa
Agreement
Level
S1 54.55 0.91 Almost Perfect
Adult S2 16.67 0.27 Fair
S3 66.67 0.20 Fair
S4 18.18 0.77 Substantial
S1 63.64 0.49 Moderate
Child S2 50.00 0.16 Slight
S3 83.33 0.21 Fair
S4 45.46 0.54 Moderate
3.2.2. Inter-group Agreement
To investigate the relationship between the children’s and
adult’s judgements, as well as that between the human judge-
ments and the ASR measures, we calculated Spearman’s corre-
lation coefficient. The interpretation of these scores is based on
[22].
18
The correlation of the Child-Adult (Table 4, second col-
umn) relation for intelligibility and reading tasks was calculated
on the mean z-scores for each sentence. For task 3, identifying
problematic words, the correlation was calculated on the per-
centage of the group which categorised the word as problem-
atic, for every word in all sentences.
The correlation of Child and Adult with ASR (Table 4, third
and fourth columns, respectively) for intelligibility and reading
tasks was calculated using the mean z-scores for each sentence
compared to the mean word confidence for each sentence ob-
tained from the ASR. For task 3 the percentage of the group
which categorised the word as problematic was compared to
the ASR confidence score for that word, for every word in all
sentences.
Z-score normalisation was used to account for the differ-
ence in strictness between raters [23]. A strong correlation ex-
ists between the children’s and adult’s judgements in task 1 and
3, with a correlation above 0.8 for both. There is less agree-
ment regarding reading accuracy, with a moderate correlation
of 0.68. This is the same case between the child group and the
ASR measure, with a correlation of -0.67. The only other cor-
relation which exists between the human judgements and the
ASR scores is for task 3 with a moderate correlation for both
Child-ASR and Adult-ASR relationships.
Table 4: Spearman’s correlation between the children’s, adult’s
and ASR measurements for each task [-1,1]
Task Child-Adult Child-ASR Adult-ASR
Intelligibility 0.81 0.04 -0.32
Reading Accuracy 0.68 -0.67 -0.16
Problematic Words 0.82 -0.45 -0.52
4. Discussion
In this paper we undertook a preliminary investigation into the
reliability and validity of primary school-aged children’s ratings
of non-native speech intelligibility, reading accuracy and their
ability to identify words with which readers had difficulty. This
was carried out to simulate the task of peer evaluation of speech
which could be collected by ASR-based language learning ap-
plication. Our first two research questions, RQ1 and RQ2, per-
tain to whether the children produce reliable evaluations of the
speech and how their judgements compare to those of adults.
A number of our findings reported in Section 3 provide evi-
dence that, despite the complex task of assessing speech, the
judgements made by the children can be considered as reliable.
Looking at Table 1, we can see that similar scores are given by
the children and the adults for intelligibility and accuracy. It is
noted that for S1, in which the reader had a strong accent, the
average word confidence is also lower than that of the other sen-
tences. This low average word confidence score is reflected in
the low accuracy score assigned by the adults and children.
The children also demonstrate high inter-rater reliability
when performing task 1 and task 2, comparable to that of the
adult participants. The strong correlation between the judge-
ments provided by the children and the adults across the three
tasks, as indicated in Table 4, further validates the children’s
judgements and their reliability.
With regards to task 3, more variation is observed across
the various evaluation metrics. While the children’s judgements
agree to a large extent with the adults’ on intelligibility and ac-
curacy, this does not seem to be the case for identifying prob-
lematic words. Task 3 shows mixed results evident in the vari-
ation in both the percentage of problematic words per sentence
and the Fleiss’ Kappa reported in Table 3, between the sen-
tences and between the two participant groups. This is inter-
esting, and requires further research. Although, it can be noted
that task 3 is a different type of task to that of task 1 and 2, and
arguably a more complex one with a greater number of choices
to be made per sentence resulting in a greater potential for vari-
ation.
Turning to RQ3, the relationship between the human judge-
ments and the ASR output is also an area for future research
and could benefit from a more in-depth exploration. In terms
of WER, for most sentences the WER is low, except for S3 for
which the WER is much higher (50%). This high WER is re-
flected in lower scores by both children and adults. Although for
S1, which was assigned similarly low accuracy scores, the WER
is 0. The correlation between the human and ASR measure-
ments also show mixed results in Table 4. The Child-ASR and
Adult-ASR relationships show a similar correlation for iden-
tifying problematic words, with an intermediate correlation of
-0.45 and -0.52, respectively. For intelligibility, both relation-
ships present with low scores, although the correlation is weak
for the adults and very weak (0.04) for the children. Greater
differences are shown for reading accuracy, where the Child-
ASR has an moderate correlation (-0.67), while Adult-ASR has
a very weak correlation of -0.16. A possible reason for the low
correlation for accuracy, is that factors which influence human
ratings, such as word insertions, deletions, or substitutions by
the reader, would not be expected to influence ASR results.
Similarly, factors such as strong accent, hesitation and filled
pauses which affect the ASR WER would not necessarily affect
the accuracy of the text being read (the human judgements).
5. Conclusion
For the evaluations of intelligibility and reading accuracy on
a scale, the children produce reliable scores, with good intra-
group agreement and inter-group agreement with the adult
group. For the task of identifying problematic words based on a
speaker reading, there is lower agreement between the children
and adults, with larger variation in the sores. The children and
adults show similar results when compared to the performance
measurements based on Whisper speech recognition, for intel-
ligibility and identifying problematic words. The correlation is
low when scoring intelligibility, and intermediate for the iden-
tification of problematic words. However, divergent trends are
seen in reading accuracy with the children’s judgements corre-
lating moderately with the ASR measurement while the corre-
lation with the adult judgements is very weak. The main limi-
tation of the study lies in the small sample size of participants
and test material. Nevertheless, the strong reliability and evi-
dence of agreement between children and adults indicates that
peer evaluation of speech by primary school-aged children is a
promising area for future research, particularly within the con-
text of ASR-based language learning applications.
6. Acknowledgements
This study forms part of the ST.CART project funded by the
European Regional Development Fund (ERDF). Special thanks
to all participants, the children’s parents, teachers and schools.
Thanks is also owed to Giel Hekkert and 8D Games BV.
19
7. References
[1] M. Russell and S. D’Arcy, “Challenges for computer recognition
of children’s speech,” in Workshop on speech and language tech-
nology in education, 2007.
[2] S. Park and J. Culnan, “A comparison between native and non-
native speech for automatic speech recognition,” The Journal of
the Acoustical Society of America, vol. 145, no. 3, pp. 1827–1827,
2019.
[3] M. Matassoni, R. Gretter, D. Falavigna, and D. Giuliani, “Non-
native children speech recognition through transfer learning,” in
2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2018, pp. 6229–6233.
[4] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
I. Sutskever, “Robust speech recognition via large-scale weak su-
pervision,” arXiv preprint arXiv:2212.04356, 2022.
[5] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
2.0: A framework for self-supervised learning of speech repre-
sentations,” Advances in neural information processing systems,
vol. 33, pp. 12 449–12 460, 2020.
[6] J. C. V´
asquez-Correa and A. ´
Alvarez Muniain, “Novel speech
recognition systems applied to forensics within child exploitation:
Wav2vec2. 0 vs. whisper,” Sensors, vol. 23, no. 4, p. 1843, 2023.
[7] M. Eskenazi, “An overview of spoken language technology for
education,” Speech Communication, vol. 51, no. 10, pp. 832–844,
2009.
[8] N. Falchikov, “Peer feedback marking: Developing peer as-
sessment,” Innovations in Education and training International,
vol. 32, no. 2, pp. 175–187, 1995.
[9] K. Topping and S. Ehly, Peer-assisted learning. Routledge, 1998.
[10] L. Harris, “Employing formative assessment in the classroom,”
Improving Schools, vol. 10, no. 3, pp. 249–260, 2007.
[11] K. Topping, “Peer assessment: Learning by judging and dis-
cussing the work of other learners,” Interdisciplinary Education
and Psychology, vol. 1, no. 1, pp. 1–17, 2017.
[12] S. Esfandiari and K. Tavassoli, “The comparative effect of self-
assessment vs. peer-assessment on young efl learners’ perfor-
mance on selective and productive reading tasks,” Iranian Journal
of Applied Linguistics (IJAL), vol. 22, no. 2, pp. 1–35, 2019.
[13] M. Ritonga, K. Tazik, A. Omar, and E. Saberi Dehkordi, “As-
sessment and language improvement: the effect of peer assess-
ment (pa) on reading comprehension, reading motivation, and vo-
cabulary learning among efl learners,” Language Testing in Asia,
vol. 12, no. 1, p. 36, 2022.
[14] M. Tunag¨
ur, “The effect of peer assessment application on writing
anxiety and writing motivation of 6th grade students.” Shanlax
International Journal of Education, vol. 10, no. 1, pp. 96–105,
2021.
[15] D. Sluijsmans, F. Dochy, and G. Moerkerke, “Creating a learning
environment by using self-, peer-and co-assessment,” Learning
environments research, vol. 1, pp. 293–319, 1998.
[16] K. J. Topping, “Peer assessment,” Theory into practice, vol. 48,
no. 1, pp. 20–27, 2009.
[17] M. Homayouni, “Peer assessment in group-oriented classroom
contexts: on the effectiveness of peer assessment coupled with
scaffolding and group work on speaking skills and vocabulary
learning,” Language Testing in Asia, vol. 12, no. 1, p. 61, 2022.
[18] C. Cucchiarini, J. Driesen, H. V. Hamme, and E. Sanders,
“Recording speech of children, non-natives and elderly people for
hlt applications: the jasmin-cgn corpus,” in Proceedings of LREC
2008, Marrakech, Morocco, 2008.
[19] J. J. Bartko, “The intraclass correlation coefficient as a measure of
reliability,” Psychological reports, vol. 19, no. 1, pp. 3–11, 1966.
[20] P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in as-
sessing rater reliability.” Psychological bulletin, vol. 86, no. 2, p.
420, 1979.
[21] J. R. Landis and G. G. Koch, “An application of hierarchical
kappa-type statistics in the assessment of majority agreement
among multiple observers,” Biometrics, pp. 363–374, 1977.
[22] L. Cohen, P. Jarvis, and J. Fowler, Practical statistics for field
biology. John Wiley & Sons, 2013.
[23] C. Cucchiarini, H. Strik, and L. Boves, “Quantitative assess-
ment of second language learners’ fluency by means of automatic
speech recognition technology,” The Journal of the Acoustical So-
ciety of America, vol. 107, no. 2, pp. 989–999, 2000.
20