Conference PaperPDF Available

Enhancing ASR-Based Educational Applications: Peer Evaluation of Non-Native Child Speech

Authors:

Abstract

With recent advancements in automatic speech recognition (ASR), ASR-based educational applications have become increasingly viable. This paper presents a preliminary investigation into whether peer evaluations of the speech produced during the use of these applications, by primary school-aged children, is reliable and valid. Twenty-one Dutch primary school children assessed non-native read speech in terms of intelligibility, accuracy, and reading performance. The children's judgements were compared to those made by adult Dutch speakers, as well as to the performance of the Whisper ASR system. The children proved to be reliable raters with agreement levels en par with the adult group, with findings indicating that primary school-aged children can provide peer evaluation of speech suitable for enhancing the feedback provided by ASR-based language learning applications.
Enhancing ASR-Based Educational Applications: Peer Evaluation of
Non-native Child Speech
Simone Wills1, Cristian Tejedor-Garc´
ıa1, Catia Cucchiarini1, Helmer Strik1
1Radboud University Nijmegen, The Netherlands
simone.wills@ru.nl, cristian.tejedorgarcia@ru.nl, catia.cucchiarini@ru.nl,
helmer.strik@ru.nl
Abstract
With recent advancements in automatic speech recognition
(ASR), ASR-based educational applications have become in-
creasingly viable. This paper presents a preliminary investiga-
tion into whether peer evaluations of the speech produced dur-
ing the use of these applications, by primary school-aged chil-
dren, is reliable and valid. Twenty-one Dutch primary school
children assessed non-native read speech in terms of intelligi-
bility, accuracy, and reading performance. The children’s judge-
ments were compared to those made by adult Dutch speakers,
as well as to the performance of the Whisper ASR system. The
children proved to be reliable raters with agreement levels en
par with the adult group, with findings indicating that primary
school-aged children can provide peer evaluation of speech suit-
able for enhancing the feedback provided by ASR-based lan-
guage learning applications.
Index Terms: speech recognition, peer evaluation, child
speech, Dutch ASR, language-learning games, non-native
speech
1. Introduction
ASR technology is ideally suited for educational applications as
it can be applied directly to the development of two fundamen-
tal skills targeted in child education; speaking and reading. Un-
til recently though, the use of ASR in educational applications
for children has been significantly limited by the challenging
speech within this domain, namely child speech [1] and, as a
result of growing global migration, multilingual and non-native
speech [2, 3]. These challenges are further compounded in low-
resource language contexts.
While these challenges have not been fully resolved, with
state-of-the-art ASR systems such as Whisper AI [4] and
wav2vec2.0 [5] we are getting closer to achieving a reasonable
standard of performance [6]. As a result, there has been an in-
creased focus on how ASR technology can be applied to educa-
tional applications [7]. Of particular interest is how ASR-based
applications can best support and encourage the development of
language skills in children. In light of this, this paper presents
a preliminary investigation into the suitability of including peer
evaluations of speech in ASR-based language learning applica-
tions aimed at primary school-aged children.
Peer evaluation involves the evaluation, critique, or pro-
vision of feedback on the quality, value and/or accuracy of
a peer’s work. This can take both a quantitative or quali-
tative form (e.g. ratings or commentary) [8, 9, 10]. The
benefits of peer evaluation are well documented [11], ranging
from increasing student engagement and motivation, to improv-
ing educational performance by facilitating reflective learning
[12, 13, 14]. Additionally, peer feedback is argued to promote
learning responsibility and critical thinking [11, 15].
The majority of these findings though, emanate from stud-
ies carried out in higher education [11], with limited literature
on the reliability and validity of peer evaluations in primary ed-
ucation. As highlighted by [16], providing effective feedback
or assessment is a cognitively complex task which requires the
evaluator to understand both the goals of the task and the cri-
teria for success, and judge the produced work in relation to
these. This is not only a relevant consideration when imple-
menting peer evaluation with primary school children, but to
the task of assessing speech itself. Assessing speech, and the
different characteristics thereof, is an arguably complex task.
Given the richness of the speech signal, the evaluator has to
contend with overlapping, and often interconnected, streams of
information while attempting to isolate the target characteristic
being evaluated.
The few studies which have looked at peer evaluation in
the context of language learning report positive effects. [13]
found that peer evaluation had a positive impact on Iranian EFL
learners’ reading comprehension, motivation, and vocabulary
learning. A study by [17] demonstrated a statistically significant
impact of peer evaluation on the development of speaking skills.
With ASR-based language learning applications generating
recordings of speech samples of peers, it creates an opportunity
for peer evaluation tasks to be carried out. If peer evaluation,
particularly amongst primary school-aged children, proves to
be reliable and informative, then the benefits of this form of
feedback and assessment can be used to enhance ASR-based
language learning applications. We conducted an investigation
in which Dutch primary school children evaluated non-native
Dutch read speech based on intelligibility, reading accuracy and
reading performance. The children’s judgements were com-
pared to those of adult Dutch speakers and to the output of
Whisper. Our main research questions were;
RQ1. Do primary school-aged children produce reliable eval-
uations of speech?
RQ2. How do the children’s assessments compare to those of
the adults?
RQ3. How do the children’s assessments correspond to eval-
uative information derived from ASR output?
The paper is structured as follows; in Section 2 the mate-
rial, participants and study design are described, followed by
the experimental results in Section 3. In Section 4 the findings
are discussed in relation to the research questions. The paper
concludes with Section 5 with an overview of the work we have
presented, the implications for future work and the limitations
of the study.
9th Workshop on Speech and Language Technology in Education (SLaTE)
18-20 August 2023, Dublin, Ireland
16 10.21437/SLaTE.2023-4
2. Methodology
In order to collect peer evaluations of speech, a listening
test was implemented in the form of an online questionnaire
wherein children were asked to evaluate recordings of other
children reading. Responses were also collected from a small
group of adults to serve as a comparison point.
2.1. Participants
2.1.1. Children
Twenty-one children, ten male and eleven female, enrolled in a
Dutch primary school in the Netherlands, completed the ques-
tionnaire. The children, aged between 10-11 years old, are na-
tive Dutch speakers, with five of the children having one addi-
tional home language. Participation was anonymous, and con-
sent was obtained from parents through consent forms approved
by the Ethics Committee of the Faculty of Arts of Radboud Uni-
versity.
2.1.2. Adults
Four native-Dutch adults, three female and one male, also com-
pleted the questionnaire. They were sourced from colleagues
within the university department. All four adults had experience
or expertise in language and linguistics. Consent was collected
during survey survey and responses were anonymised.
2.2. Questionnaire Design
A Qualtrics1survey was used to collect participant responses.
It is comprised of three tasks in which the children were pre-
sented short recordings of read speech and asked to assess the
intelligibility, reading accuracy, and identify problematic words
for the speakers, respectively.
The survey was presented to the children as a fun listening
exercise. We framed it in this was as the children were complet-
ing an unrelated reading task in the same session as the survey,
and we wanted to reduce the chance of the children becoming
self-conscious of their own reading skills.
2.2.1. Speech Data
The audio recordings used were sourced from the JASMIN cor-
pus [18]. The JASMIN corpus is a Dutch-Flemish corpus of
read and extemporaneous speech collected from children, non-
native speakers, and elderly people. Due to time restraints of
the session in which the responses were collected, the ques-
tionnaire was designed to be relatively short. Eight sentences
were selected, each produced by a different non-native speaker
of Dutch, aged between 11-17 years (Group 3 in the corpus).
Second language (L2) learners were chosen as a reflection of the
increasing diversity and multilingualism in society and schools.
Furthermore, feedback on intelligibility is particularly benefi-
cial for L2 learners.
The sentences were extracted from recordings of children
reading texts intended for L2 Dutch learners. The sentences
were selected based on age-appropriateness for the listeners, in-
telligibility, and the number and type of speech and/or reading
errors, to produce a sample of sentences ranging in difficulty
for each task. During selection the reader’s age, sex and read-
ing level were also taken into account to try create a balanced
sample. The sentences ranged from six to twelve words, except
for one of twenty-one words (S5).
1https://www.qualtrics.com/
2.2.2. Task 1: Intelligibility
In the first task, the children were presented the eight audio clips
and asked to rate each recording, on a scale from 0 to 10, how
intelligible the speaker was. 0 representing nothing is intelligi-
ble, whereas 10 everything is intelligible. In all tasks the chil-
dren were able to replay the recordings.
2.2.3. Task 2: Reading Accuracy
The second task was rating reading accuracy. The children were
presented seven of the eight audio recordings, in a different or-
der to task 1. The text prompt which was being read in the
recording was provided on screen with the audio clip. The chil-
dren were asked to compare what they heard in the recording to
the reading prompt and rate, on a scale from 0 to 10, how accu-
rate the reader was. 0 indicates none of the words in the audio
are the same as the text, with 10 indicating that all of the words
in the audio are the same as in the text.
2.2.4. Task 3: Identifying Problematic Words
The final task was expected to be more complex for the chil-
dren, so only four sentences were presented. In the task the
children were given the audio recording and the reading prompt
for each sentence. Each word in the reading prompt could be
highlighted on screen if the child clicked on the word. The chil-
dren were informed that the speakers they heard were practising
their Dutch, and they could help the speakers by identify which
words the speakers needed to practice. The children were asked
to highlight the words in the reading prompt which the speaker
read incorrectly or which sounded strange, in other words ‘prob-
lematic’ words.
2.3. Procedure
The children completed the questionnaire independently in their
classroom at school, as part of a demonstration and testing of an
ASR-based language learning game. The children were divided
into two groups and while one group tested the language game,
the other completed the questionnaire on tablets provided by the
testers. The two groups then swapped over. The children did not
use headphones when completing the questionnaire. A unique
user id was assigned to each child and entered into the question-
naire by the tester before the child began the questionnaire.
2.4. ASR
To incorporate ASR based measures into the study, we used the
state-of-the-art ASR system, Whisper [4], to obtain the audio
recording transcriptions and the whisper-timestamped2Python
module extension to obtain word confidence scores. Whisper is
a transformer-based model designed for sequence-to-sequence
tasks. The “Whisper-large v2” model was trained using a vast
amount of labelled audio data, amounting to approximately
680,000 hours of labelled data, sourced from the Internet. This
training approach was conducted in a fully supervised manner.
Whisper has demonstrated exceptional performance in terms of
Word Error Rate (WER) on many benchmark datasets for ASR,
including TEDLIUM, librispeech and Common Voice.
2https://github.com/linto-ai/
whisper-timestamped
17
Table 1: Human-based mean scores for intelligibility [1,10] and accuracy [1,10], and ASR-based WER [0,100] and mean word
confidence score [0,100]
Human ASR
Sentence ID Intelligibility Accuracy WER Word Confidence
Child Adult Child Adult (Ave.)
S1 8.91 8.50 4.86 3.0 0 52.80
S2 8.05 6.75 9.0 10.0 0 83.82
S3 5.57 3.25 4.81 5.25 50.0 71.27
S4 8.71 7.0 7.24 7.50 9.09 81.06
S5 7.57 6.5 8.19 7.25 4.55 80.76
S6 8.81 9.0 7.81 7.75 0 71.12
S7 7.19 8.25 7.05 8.0 10.0 74.35
S8 5.33 4.75 - - 0 80.38
3. Results
This section presents our analysis of the children’s judgements
in terms of reliability and validity, as compared to the adult par-
ticipants and ASR system. All values have been rounded to two
decimal places.
Table 1 presents the children’s mean scores for intelligi-
bility and reading accuracy per sentence, in comparison to the
mean scores of the adults. Additionally, two ASR-based mea-
surements are provided as counterpoints to the human ratings;
the WER and the mean word confidence score for each sen-
tence. While word confidence scores are typically reported as
values between 0 and 1, they have been included in the table as
percentages.
3.1. Reliability
3.1.1. ICC2k Scores
To assess the reliability of the children’s judgements the intr-
aclass correlation coefficient (ICC) [19] was calculated. This
measurement compares the variability of ratings for a single
item to the total variation seen across all ratings of all items.
The ICC value ranges from 0 to 1, with 1 indicating perfect re-
liability among raters. There are six different variations of the
ICC described by [20]. The ICC model used in this paper is
the two-way random effects model, measuring absolute agree-
ment estimated on the average of multiple raters/measurement
(ICC2k). In brief, both raters and items are considered a source
of random effects and all the items are rated by all the raters.
The python package Pingouin3was used to calculate the ICC2k
scores.
The ICC2k scores for both the children and the adults are
provided in Table 2. The scores were calculated for both the
intelligibility and the reading accuracy ratings. Both groups
have high ICC scores, with the children receiving an average
ICC2k score across both tasks of 0.87, while the adult group
has a lower average value (0.81).
3.2. Agreement
3.2.1. Intra-group Agreement
Fleiss’ Kappa has been used to measure the level of agreement
between the participants within the two groups in their judge-
ments made in task 3. In calculating Fleiss’ kappa, task 3 is
3https://pingouin-stats.org/build/html/index.
html
Table 2: ICC2k scores for the children and the adults for their
ratings of intelligibility and reading accuracy [0,1]
Task Child Adult
Intelligibility 0.86 0.81
Reading Accuracy 0.88 0.82
treated as a categorisation task in which the participants have
to categorise each word in the provided sentences as ‘problem-
atic’ (for the speaker) or ‘non-problematic’. When a participant
highlights a word it is interpreted as categorising the word as
‘problematic’, non-highlighted words are categorised as ‘non-
problematic’ by default. The kappa value is reported for the
children and adult group, per sentence, in Table 3. The inter-
pretation of level of agreement is based on the guideline pro-
vided by [21]. Table 3 also includes the percentage of words
in the sentence which have been categorised as ‘problematic’.
The agreement between participants varies across sentences.
The children achieve slight to moderate levels of agreement,
whereas this ranges from fair to almost perfect agreement for
the adults. There is, though, a moderate or higher agreement
within both groups for the same two sentences, S1 and S4.
Table 3: Fleiss’ Kappa [0,1] of the first four sentences for adults
and children
Group Sentence ID Problematic
Words (%)
Fleiss’
Kappa
Agreement
Level
S1 54.55 0.91 Almost Perfect
Adult S2 16.67 0.27 Fair
S3 66.67 0.20 Fair
S4 18.18 0.77 Substantial
S1 63.64 0.49 Moderate
Child S2 50.00 0.16 Slight
S3 83.33 0.21 Fair
S4 45.46 0.54 Moderate
3.2.2. Inter-group Agreement
To investigate the relationship between the children’s and
adult’s judgements, as well as that between the human judge-
ments and the ASR measures, we calculated Spearman’s corre-
lation coefficient. The interpretation of these scores is based on
[22].
18
The correlation of the Child-Adult (Table 4, second col-
umn) relation for intelligibility and reading tasks was calculated
on the mean z-scores for each sentence. For task 3, identifying
problematic words, the correlation was calculated on the per-
centage of the group which categorised the word as problem-
atic, for every word in all sentences.
The correlation of Child and Adult with ASR (Table 4, third
and fourth columns, respectively) for intelligibility and reading
tasks was calculated using the mean z-scores for each sentence
compared to the mean word confidence for each sentence ob-
tained from the ASR. For task 3 the percentage of the group
which categorised the word as problematic was compared to
the ASR confidence score for that word, for every word in all
sentences.
Z-score normalisation was used to account for the differ-
ence in strictness between raters [23]. A strong correlation ex-
ists between the children’s and adult’s judgements in task 1 and
3, with a correlation above 0.8 for both. There is less agree-
ment regarding reading accuracy, with a moderate correlation
of 0.68. This is the same case between the child group and the
ASR measure, with a correlation of -0.67. The only other cor-
relation which exists between the human judgements and the
ASR scores is for task 3 with a moderate correlation for both
Child-ASR and Adult-ASR relationships.
Table 4: Spearman’s correlation between the children’s, adult’s
and ASR measurements for each task [-1,1]
Task Child-Adult Child-ASR Adult-ASR
Intelligibility 0.81 0.04 -0.32
Reading Accuracy 0.68 -0.67 -0.16
Problematic Words 0.82 -0.45 -0.52
4. Discussion
In this paper we undertook a preliminary investigation into the
reliability and validity of primary school-aged children’s ratings
of non-native speech intelligibility, reading accuracy and their
ability to identify words with which readers had difficulty. This
was carried out to simulate the task of peer evaluation of speech
which could be collected by ASR-based language learning ap-
plication. Our first two research questions, RQ1 and RQ2, per-
tain to whether the children produce reliable evaluations of the
speech and how their judgements compare to those of adults.
A number of our findings reported in Section 3 provide evi-
dence that, despite the complex task of assessing speech, the
judgements made by the children can be considered as reliable.
Looking at Table 1, we can see that similar scores are given by
the children and the adults for intelligibility and accuracy. It is
noted that for S1, in which the reader had a strong accent, the
average word confidence is also lower than that of the other sen-
tences. This low average word confidence score is reflected in
the low accuracy score assigned by the adults and children.
The children also demonstrate high inter-rater reliability
when performing task 1 and task 2, comparable to that of the
adult participants. The strong correlation between the judge-
ments provided by the children and the adults across the three
tasks, as indicated in Table 4, further validates the children’s
judgements and their reliability.
With regards to task 3, more variation is observed across
the various evaluation metrics. While the children’s judgements
agree to a large extent with the adults’ on intelligibility and ac-
curacy, this does not seem to be the case for identifying prob-
lematic words. Task 3 shows mixed results evident in the vari-
ation in both the percentage of problematic words per sentence
and the Fleiss’ Kappa reported in Table 3, between the sen-
tences and between the two participant groups. This is inter-
esting, and requires further research. Although, it can be noted
that task 3 is a different type of task to that of task 1 and 2, and
arguably a more complex one with a greater number of choices
to be made per sentence resulting in a greater potential for vari-
ation.
Turning to RQ3, the relationship between the human judge-
ments and the ASR output is also an area for future research
and could benefit from a more in-depth exploration. In terms
of WER, for most sentences the WER is low, except for S3 for
which the WER is much higher (50%). This high WER is re-
flected in lower scores by both children and adults. Although for
S1, which was assigned similarly low accuracy scores, the WER
is 0. The correlation between the human and ASR measure-
ments also show mixed results in Table 4. The Child-ASR and
Adult-ASR relationships show a similar correlation for iden-
tifying problematic words, with an intermediate correlation of
-0.45 and -0.52, respectively. For intelligibility, both relation-
ships present with low scores, although the correlation is weak
for the adults and very weak (0.04) for the children. Greater
differences are shown for reading accuracy, where the Child-
ASR has an moderate correlation (-0.67), while Adult-ASR has
a very weak correlation of -0.16. A possible reason for the low
correlation for accuracy, is that factors which influence human
ratings, such as word insertions, deletions, or substitutions by
the reader, would not be expected to influence ASR results.
Similarly, factors such as strong accent, hesitation and filled
pauses which affect the ASR WER would not necessarily affect
the accuracy of the text being read (the human judgements).
5. Conclusion
For the evaluations of intelligibility and reading accuracy on
a scale, the children produce reliable scores, with good intra-
group agreement and inter-group agreement with the adult
group. For the task of identifying problematic words based on a
speaker reading, there is lower agreement between the children
and adults, with larger variation in the sores. The children and
adults show similar results when compared to the performance
measurements based on Whisper speech recognition, for intel-
ligibility and identifying problematic words. The correlation is
low when scoring intelligibility, and intermediate for the iden-
tification of problematic words. However, divergent trends are
seen in reading accuracy with the children’s judgements corre-
lating moderately with the ASR measurement while the corre-
lation with the adult judgements is very weak. The main limi-
tation of the study lies in the small sample size of participants
and test material. Nevertheless, the strong reliability and evi-
dence of agreement between children and adults indicates that
peer evaluation of speech by primary school-aged children is a
promising area for future research, particularly within the con-
text of ASR-based language learning applications.
6. Acknowledgements
This study forms part of the ST.CART project funded by the
European Regional Development Fund (ERDF). Special thanks
to all participants, the children’s parents, teachers and schools.
Thanks is also owed to Giel Hekkert and 8D Games BV.
19
7. References
[1] M. Russell and S. D’Arcy, “Challenges for computer recognition
of children’s speech, in Workshop on speech and language tech-
nology in education, 2007.
[2] S. Park and J. Culnan, “A comparison between native and non-
native speech for automatic speech recognition, The Journal of
the Acoustical Society of America, vol. 145, no. 3, pp. 1827–1827,
2019.
[3] M. Matassoni, R. Gretter, D. Falavigna, and D. Giuliani, “Non-
native children speech recognition through transfer learning, in
2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2018, pp. 6229–6233.
[4] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
I. Sutskever, “Robust speech recognition via large-scale weak su-
pervision,” arXiv preprint arXiv:2212.04356, 2022.
[5] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
2.0: A framework for self-supervised learning of speech repre-
sentations,” Advances in neural information processing systems,
vol. 33, pp. 12 449–12 460, 2020.
[6] J. C. V´
asquez-Correa and A. ´
Alvarez Muniain, “Novel speech
recognition systems applied to forensics within child exploitation:
Wav2vec2. 0 vs. whisper, Sensors, vol. 23, no. 4, p. 1843, 2023.
[7] M. Eskenazi, “An overview of spoken language technology for
education,” Speech Communication, vol. 51, no. 10, pp. 832–844,
2009.
[8] N. Falchikov, “Peer feedback marking: Developing peer as-
sessment,” Innovations in Education and training International,
vol. 32, no. 2, pp. 175–187, 1995.
[9] K. Topping and S. Ehly, Peer-assisted learning. Routledge, 1998.
[10] L. Harris, “Employing formative assessment in the classroom,
Improving Schools, vol. 10, no. 3, pp. 249–260, 2007.
[11] K. Topping, “Peer assessment: Learning by judging and dis-
cussing the work of other learners,” Interdisciplinary Education
and Psychology, vol. 1, no. 1, pp. 1–17, 2017.
[12] S. Esfandiari and K. Tavassoli, “The comparative effect of self-
assessment vs. peer-assessment on young efl learners’ perfor-
mance on selective and productive reading tasks, Iranian Journal
of Applied Linguistics (IJAL), vol. 22, no. 2, pp. 1–35, 2019.
[13] M. Ritonga, K. Tazik, A. Omar, and E. Saberi Dehkordi, “As-
sessment and language improvement: the effect of peer assess-
ment (pa) on reading comprehension, reading motivation, and vo-
cabulary learning among efl learners,” Language Testing in Asia,
vol. 12, no. 1, p. 36, 2022.
[14] M. Tunag¨
ur, “The effect of peer assessment application on writing
anxiety and writing motivation of 6th grade students. Shanlax
International Journal of Education, vol. 10, no. 1, pp. 96–105,
2021.
[15] D. Sluijsmans, F. Dochy, and G. Moerkerke, “Creating a learning
environment by using self-, peer-and co-assessment,” Learning
environments research, vol. 1, pp. 293–319, 1998.
[16] K. J. Topping, “Peer assessment, Theory into practice, vol. 48,
no. 1, pp. 20–27, 2009.
[17] M. Homayouni, “Peer assessment in group-oriented classroom
contexts: on the effectiveness of peer assessment coupled with
scaffolding and group work on speaking skills and vocabulary
learning,” Language Testing in Asia, vol. 12, no. 1, p. 61, 2022.
[18] C. Cucchiarini, J. Driesen, H. V. Hamme, and E. Sanders,
“Recording speech of children, non-natives and elderly people for
hlt applications: the jasmin-cgn corpus,” in Proceedings of LREC
2008, Marrakech, Morocco, 2008.
[19] J. J. Bartko, “The intraclass correlation coefficient as a measure of
reliability, Psychological reports, vol. 19, no. 1, pp. 3–11, 1966.
[20] P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in as-
sessing rater reliability. Psychological bulletin, vol. 86, no. 2, p.
420, 1979.
[21] J. R. Landis and G. G. Koch, “An application of hierarchical
kappa-type statistics in the assessment of majority agreement
among multiple observers,” Biometrics, pp. 363–374, 1977.
[22] L. Cohen, P. Jarvis, and J. Fowler, Practical statistics for field
biology. John Wiley & Sons, 2013.
[23] C. Cucchiarini, H. Strik, and L. Boves, “Quantitative assess-
ment of second language learners’ fluency by means of automatic
speech recognition technology, The Journal of the Acoustical So-
ciety of America, vol. 107, no. 2, pp. 989–999, 2000.
20
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The growth in online child exploitation material is a significant challenge for European Law Enforcement Agencies (LEAs). One of the most important sources of such online information corresponds to audio material that needs to be analyzed to find evidence in a timely and practical manner. That is why LEAs require a next-generation AI-powered platform to process audio data from online sources. We propose the use of speech recognition and keyword spotting to transcribe audiovisual data and to detect the presence of keywords related to child abuse. The considered models are based on two of the most accurate neural-based architectures to date: Wav2vec2.0 and Whisper. The systems were tested under an extensive set of scenarios in different languages. Additionally, keeping in mind that obtaining data from LEAs are very sensitive, we explore the use of federated learning to provide more robust systems for the addressed application, while maintaining the privacy of the data from LEAs. The considered models achieved a word error rate between 11% and 25%, depending on the language. In addition, the systems are able to recognize a set of spotted words with true-positive rates between 82% and 98%, depending on the language. Finally, federated learning strategies show that they can maintain and even improve the performance of the systems when compared to centralized trained models. The proposed systems set the basis for an AI-powered platform for automatic analysis of audio in the context of forensic applications of child abuse. The use of federated learning is also promising for the addressed scenario, where data privacy is an important issue to be managed.
Article
Full-text available
Peer learning, also known as collaborative learning, is based on social constructivism and contends that learning takes place more actively when students interact socially with their peers. This study sought to examine the effects of scaffolded peer assessment in group-oriented classrooms on developing speaking skills and enlarging the vocabulary size of language learners. To accomplish this objective, through cluster sampling, the researchers selected 37 lower-intermediate and 5 intermediate learners of English. Then, 20 lower-intermediate subjects were randomly assigned as the experimental group along with the 5 more proficient learners. In groups of 5, the intermediate learner was assigned the role of the mediator and was responsible for giving feedback to their peers. The rest of the subjects were assigned as the control group and there was no mediator in their group. Four instructional sessions were allocated to scaffolded peer assessment of speaking and four sessions were devoted to scaffolded peer assessment of vocabulary learning. In this randomized pre-test–post-test-delayed post-test trial, an independent sample t test, and a one-way repeated measures ANOVA were carried out. The results of the statistical analyses demonstrated the impact of scaffolded peer assessment on developing both speaking skills and enlarging learners’ vocabulary size with a large effect size. That is, by implementing scaffolded peer assessment in a group-oriented context both speaking skills and vocabulary knowledge can be developed. The pedagogical implication of this study is that language teacher can implement the notions of social constructivism and socio-cultural theory proposed by Vygotsky (Readings on the Development of Children 23:34-41, 1978) to expand and develop learners’ zone of proximal development.
Article
Full-text available
Assessment is an inseparable part of teaching and learning, and it helps teachers and students to modify their teaching and learning processes. One type of assessment is peer assessment (PA), and its efects were examined on developing Iranian EFL learners’ reading comprehension (RC), reading motivation (RM) and vocabulary learning (VL) in this research. To achieve this goal, 60 Iranian EFL students at the intermediate level and with the age range of 18 to 26 years were selected based on a convenience sampling method and divided into two groups: the control group (CG) and experimental group (EG). After that, the groups took the pre-tests of RC, RM, and VL. The EG then was divided into six sub-groups of fve, and PA was applied to evaluate their reading and vocabulary performances. The learners in the EG checked their classmates’ performances in peers with the teacher’s help. In the CG class, the participants themselves assessed their own performances after each test with the help of the teacher. After a 15-session treatment, the post-tests of vocabulary, RM, and RC were carried out on both groups. The outcomes of the one-way ANCOVA tests demonstrated that the EG outfanked the CG on the three post-tests of RC, RM, and VL. In fact, the results indicated that using PA generated positive efects on Iranian EFL learners’ RC, RM, and VL. It can be claimed that the PA is a practical technique to improve EFL learners’ language learning. The implications of this research can reduce students’ dependency on teachers and increase their independence in the evaluation process.
Article
Full-text available
This research, aims to investigate whether peer assessment application has an effect on writing anxiety and writing motivation of 6th grade students. It is one of the quantitative research designs. The study group of the research consists of 6th grade students attending a secondary school in Turkey. There are 35 students in the study, 17 are in the experimental group and 18 are in the control group. The data of the study were collected with “Writing Anxiety Scale” and the “Writing Motivation Scale”. For six weeks, a peer assessment application was carried out and the texts written by the students were evaluated by their peers with a peer assessment form. In the analysis of the data, firstly t-test was performed for unrelated groups, and two-way ANOVA was used for complex measures. According to the findings obtained in the research, it was concluded that writing anxiety of the experimental group students, who have lessons with peer assessment method, decreased significantly compared to the control group students who have lessons with the current program. When the students’ writing motivations were examined, it was found that writing motivation scores of the experimental group students were higher than the control group students. In line with these findings, it can be stated that peer assessment application reduces students’ writing anxiety and increases their writing motivation. Based on the results obtained, some recommendations were made.
Article
Full-text available
To develop the skills and competencies required in professional organisations, students have to reflect on their own behaviour. Many current assessment practices in higher education do not answer this need. The recent interest in new assessment forms, such as self-, peer-, and co-assessment, can be seen as a means to tackle this problem. In the present article, a review of the literature provides answers to two questions: (1) How are self-, peer- and co-assessment applied in higher education? and (2) What are the effects of the use of these forms of assessment on the quality of the learning environment? Analyses of 62 studies showed that self-, peer- and co-assessment can be effective tools in developing competencies needed as a professional. These forms of assessment are often used in combination with each other. Implementation of these forms of assessments accelerates the developments of a curriculum based on competencies (knowledge as a tool) rather than knowledge (as a goal) and leads towards the integration of instruction and assessment in higher education. As such, this development of a learning environment contributes to the education of responsible and reflective professionals.
Article
No PDF available ABSTRACT This study investigates differences in sentence and story production between native and non-native speakers of English for use with a system of Automatic Speech Recognition (ASR). Previous studies have shown that production errors by non-native speakers of English include misproduced segments (Flege, 1995), longer pause duration (Anderson-Hsieh and Venkatagiri, 1994), abnormal pause location within clauses (Kang, 2010), and non-reduction of function words (Jang, 2009). The present study uses phonemically balanced sentences from TIMIT (Garofolo et al., 1993) and a story to provide an additional comparison of the differences in production by native and non-native speakers of English. Consistent with previous research, preliminary results suggest that non-native speakers of English fail to produce flaps and reduced vowels, insert or delete segments, engage in more self-correction, and place pauses in different locations from native speakers. Non-native English speakers furthermore produce different patterns of intonation from native speakers and produce errors indicative of transfer from their L1 phonology, such as coda deletion and vowel epenthesis. Native speaker productions also contained errors, the majority of which were content-related. These results indicate that difficulties posed by English ASR systems in recognizing non-native speech are due largely to the heterogeneity of non-native production.
Article
The Assessment is for Learning programme has been implemented by Learning and Teaching Scotland over recent years and there will be further developments throughout the Scottish curriculum up until and beyond 2007. This article re-examines the techniques of formative assessment that can progress pupil learning and appeals to teachers to review their practice against the background of the research. The purpose of this article is to encourage practitioners in all parts of the UK to embark upon a systematic review of their teaching with regard to formative assessment. In light of the research nationwide practitioners now have a plethora of effective techniques that can be embedded within all aspects of teaching in order to enhance their practice. Teachers may now wish to re-examine their own teaching with a view to exploiting the various strategies and techniques to benefit their pupils.
Article
Some studies of peer assessment in higher education are reviewed, and found to focus on either assessment of a product such as an examination script, or of the performance of a particular skill, often in a medical or dental setting. Classroom performance studies focus mainly on interpersonal skills or group dynamics. Many examples where mean peer assessments resembled lecturer assessments were found, and the overwhelming view seems to be that peer assessment is a useful, reliable and valid exercise. Student evaluations of peer assessment suggest that they also perceive it to be beneficial. However, some students expressed a dislike of awarding a grade to their peers, particularly in the context of a small, well established group. A study which attempted to capitalize on the benefits of peer assessment while minimizing the problems is described. In this study, the emphasis was on critical feedback, rather than on the awarding of a grade, though this was required also. Results indicated a close correspondence between lecturer and peer marks, as in previous studies. Feedback was perceived to be useful, and the scheme of Peer Feedback Marking (PFM) rated as conferring more benefits than the more usual, lecturer marked method. The main strength of PFM seems to be related to the enhancement of student learning by means of reflection, analysis and diplomatic criticism.