ArticlePDF Available

Text-to-Speech Software and Learning: Investigating the Relevancy of the Voice Effect

Authors:

Abstract and Figures

Technology advances quickly in today’s society. This is particularly true in regard to instructional multimedia. One increasingly important aspect of instructional multimedia design is determining the type of voice that will provide the narration; however, research in the area is dated and limited in scope. Using a randomized pretest–posttest design, we examined the efficacy of learning from an instructional animation where narration was provided by an older text-to-speech engine, a modern text-to-speech engine, or a recorded human voice. In most respects, those who learned from the modern text-to-speech engine were not statistically different in regard to their perceptions, learning outcomes, or cognitive efficiency measures compared with those who learned from the recorded human voice. Our results imply that software technologies may have reached a point where they can credibly and effectively deliver the narration for multimedia learning environments.
Content may be subject to copyright.
Running head: THE RELEVANCY OF THE VOICE EFFECT FOR LEARNING
Text to Speech Software and Learning: Investigating the Relevancy of the Voice Effect
Scotty D. Craig1 *
1 Arizona State University, Human Systems Engineering, 7271 E. Sonoran Arroyo Mall
Mesa, AZ 85212 USA, (scotty.craig@asu.edu)
Noah L. Schroeder2
2 Wright State University, College of Education and Human Services, Leadership Studies in
Education and Organizations, 442 Allyn Hall, 3640 Colonel, Glenn Hwy, Dayton, OH 45435
USA, (noah.schroeder@wright.edu)
Craig, S. D., & Schroeder, N. L. (2019). Text-to-speech software and learning: Investigating the relevancy
of the voice effect. Journal of Educational Computing Research, 57(6), 1534-1548.
* Corresponding author:
Scotty D. Craig
7271 E Sonoran Arroyo Mall
Santa Catalina Hall, Ste. 150
Mesa, AZ 85212
Phone: 480-727-1006
Fax: 480-727-1538
Email: Scotty.Craig@asu.edu
Text to Speech Software and Learning: Investigating the Relevancy of the Voice Effect
Abstract:
Technology advances quickly in today’s society. This is particularly true in regards to
instructional multimedia. One increasingly important aspect of instructional multimedia design is
determining the type of voice that will provide the narration; however, research in the area is
dated and limited in scope. Using a randomized pretest-posttest design, we examined the efficacy
of learning from an instructional animation where narration was provided by either an older text-
to-speech engine, a modern text-to-speech engine, or a recorded human voice. In most respects,
those who learned from the modern text-to-speech engine were not statistically different in
regards to their perceptions, learning outcomes, or cognitive efficiency measures compared to
those who learned from the recorded human voice. Our results imply that software technologies
may have reached a point where they can credibly and effectively deliver the narration for
multimedia learning environments.
Keywords: Voice Effect, synthesized voice, narration, multimedia learning
Text to Speech Software and Learning: Investigating the Relevancy of the Voice Effect
Teachers and instructional designers are increasingly designing multimedia-based
instruction for online learning environments. Regardless of the form of online multimedia,
whether it is websites, videos, or even smartphone applications, there is a plethora of evidence-
based best practices to help guide design decisions depending on a variety of contextual factors,
such as the population being taught and the qualities of the learning system. Much of this work
has stemmed from the cognitive theory of multimedia learning (Mayer, 2014a), which describes
a cognitive model of how people learn with multimedia, and cognitive load theory (Paas &
Sweller, 2014; Sweller, 2010), which describes how people learn novel information regardless of
teaching modality. Research investigating best practices for designing instruction based on these
theories has led to a number of instructional design principles that can be applicable in a variety
of situations.
These evidence-based instructional design principles can potentially become the corner
stone of many instructors’ design decisions as they create instructional materials. However, not
all of the design principles have been thoroughly researched in a wide variety of situations, and
while the cognitive impacts of many of these principles are well established, the impacts on the
learners’ perceptions have not been thoroughly researched in many cases. An example of this is
the voice effect, or voice principle (Mayer, 2014b). Mayer’s (2014b) review describes the voice
principle as stating that narration should be provided by “a standard-accented human voice rather
than … a machine voice” (p. 358) based on the research literature (median d = .74).
Accordingly, a designer would likely avoid using text-to-speech generated voices when
designing their instruction. For the purposes of this paper, the terms text-to-speech voice,
machine-voice, and computer-generated voices are used interchangeably since they refer to the
same genre of technologies.
Critically examining the evidence for the voice effect, one will note a few areas where the
scope of the existing literature base is limited (Mayer, 2014b). Notably, the few published
studies around the principle are largely dated with two of the three studies having been published
in the early to mid- 2000’s. The studies found consistent results, having showed that the use of a
human voice led to improved learning scores compared to machine-generated voices, and the
human voices were perceived as being significantly more favorable compared to the machine-
generated voices (Atkinson, Mayer, & Merrill, 2005; Mayer, Sabko, & Mautone, 2003).
These results are easily explained through cognitive load theory (Kalyuga, 2011; Paas &
Sweller, 2014; Sweller, 2010) and social agency theory (Atkinson et al., 2005; Mayer et al.,
2003). Cognitive load theorists may suggest that the recorded human voice was superior for
learning because the machine voice caused extraneous cognitive load (Mayer et al., 2003).
Essentially, this explanation hinges on the notion that the computer-generated voice required
additional mental effort to comprehend, mental effort that could have been used for learning if
the voice was not difficult to understand, which thereby impeded the learning process.
An alternative explanation for the voice effect is social agency theory. Social agency
theorists suggest that social cues in a learning environment can encourage the learner to try to
learn the material (Atkinson et al., 2005; Mayer et al., 2003), although recent research has shown
that unlikeable social cues can actually impede learning (Domagk, 2010). Based on this
evidence, we question if the machine voices used in early studies of the voice effect were not
well-liked by the participants. If that were the case, Domagk’s findings would suggest that the
participants’ dislike of the voice could explain why the machine-voice groups did not perform
well on learning outcome tests.
As technology has advanced, there has been a dearth of research around the voice effect
and the use of text-to-speech software compared to recorded human voices. However, Mayer
and DaPra (2012) revisited the voice effect by comparing modern text-to-speech software
compared to a recorded human voice. Interestingly, their results showed no significant
differences in learning outcomes between groups. A second study by Craig and Schroeder (2017)
examined the influence of a virtual human that communicated through a modern computer-
generated voice compared or a recorded human voice, and largely found no significant
differences between the two. However, the modern computer-generated voice provided benefits
on the transfer test compared to other conditions. Based on these recent findings, we question, as
Craig and Schroeder did, if the voice effect could presumably be an artifact of available
technologies. As noted, most of the studies around the voice effect occurred between 2000 and
2010. The text-to-speech technology available to researchers has drastically changed over this
timespan, and computer-generated voices are now common-place. High quality text-to-speech
generators can be purchased through the internet for reasonable fees, and in some cases may
even be available free of charge.
Hypotheses and Predictions
Due to the increase of accessible and affordable high quality text-to-speech technologies,
recent findings which indicate that the voice effect may no longer exist (Craig & Schroeder,
2017; Mayer & DaPra, 2012), and the dearth of recent literature around the voice effect in
general, it is necessary to explore whether the voice effect still holds across multiple content
domains, learner characteristics, and multimedia formats. The purpose of this study is to begin
this exploration through the investigation of three different types of voices, including a text-to-
speech generated voice that was used in a prior study in the area (Atkinson, Mayer, & Merrill,
2005), a modern text-to-speech generated voice, and a recorded human voice. While recent
studies have investigated different types of voices with the aid of a virtual human (Craig &
Schroeder, 2017), in this study we examine the effect in a multimedia animation without a virtual
human presence, as presumably their presence could influence the impact of the voice effect. We
posit that the voice effect is a byproduct of technological limitations rather than a binding,
persistent limitation an instructional designer should be concerned with.
The current study investigated the impact of the voice used for narration within a
multimedia learning video on the learner’s experience in terms of perceptions, learning, and
perceived cognitive load. Based on the literature, three separate hypotheses were tested. The
voice effect (Atkinson et al., 2005; Mayer, 2014b; Mayer et al., 2003) would claim that
decreased quality and monotone intonation increase extraneous processing (Kalyuga, 2011; Paas
& Sweller, 2014; Sweller, 2010), meaning that those learning from computerized voices will
perform worse than those learning from a human voice. Thus, narration by a human would result
in higher perceptions, learning, and better mental effort measures than any synthesized voice.
However, the quality of synthesized voices has increased and it is possible that the only the
quality of the voice matters (Remez, Rubin, Pisoni, & Carrell, 1981). In this case, if the mental
effort required understanding the voice quality was reduced, then the learning gains predicted by
the voice effect would disappear. Therefore, it could be predicted that learning differences for
high quality synthesized voice could be equal to human voice and that both would outperform
lower quality synthesized voice for learning. This is not to say that participants cannot tell the
difference between the human and synthesized voice, just that it will not significantly impact
learning. The third possible hypothesis (null) was that the voice providing the narration does not
matter. In this case, it would be predicted that all three conditions would be statistically equal for
all measures.
Methods
Participants and Design
This study implemented a randomized pretest-posttest design. Participants (n = 150) were
randomly assigned to view a presentation using a classic text-to-speech software (n = 47, same
voice software as used in Atkinson et al., 2005), a modern text-to-speech software (n = 53, same
voice software as Craig & Schroeder, 2017), or a recorded human voice condition (n = 50).
Participants were users of Amazon.com’s Mechanical Turk (MTurk) that voluntarily took
part in the study. Study participation was limited to MTurkers who had at participated in at least
50 tasks and maintained a 95% or above Human Intelligence Task (HIT) approval rating to
encourage collection of reliable data (see Paolacci & Chandler, 2014). Participants were limited
by location to within the United States, because a standard American English voice was
implemented for the human voice condition. Participation in the study resulted in $1.00 US as
compensation for each participant. In total, 55 male and 95 female MTurkers participated in the
study, and the modal age range was 26-34.
Materials
The learning materials, adapted from Moreno and Mayer’s (1999) materials on lightning
formation, were identical between the three conditions with the only difference being the voice
that provided the narration. The basic material presented visual images on the formation of
lightning along with 19 statements, which describe the formation. The text of the narration can
be found in the Appendix of Moreno and Mayer (1999). The images were a recreation of the
images originally implemented in the Moreno and Mayer article. They are available upon request
to the corresponding author.
Three different voices were used to present the material. The voices mirrored those used
by Craig and Schroeder (2017). The classic text-to-speech software condition used “Mary”, the
Microsoft speech engine voice as was used in Atkinson et al.’s (2005) study. While
understandable to the listener, this voice had a digital quality with clipped or choppy production
and no inflection. A video clip with the voice can be viewed at the following link:
https://youtu.be/rZl7N_xPYFw. The modern text-to-speech software used was Neospeech
(neospeech.com), and the specific voice used was “Kate”. This voice engine, while still
computer-generated without inflection or prosody, does not have the synthesized tone and has a
smoother voice presentation. A video clip with the voice can be viewed at the following link:
https://youtu.be/PSJY1wbnM4I. Finally, the human voice was recorded by a female with an
American accent. The human voice was recorded at a similar speed as the computerized voice
engines using a HD microphone at 705 kbps. A video clip with the voice can be viewed at the
following link: https://youtu.be/9BilX7wzHSI.
Agent Persona Instrument (API). The API consists of 25-items. The instrument is
scored using a 5-point Likert scale (1 = strongly disagree, 3 = neutral, 5 = strongly agree) and is
designed to measure participants’ perceptions of virtual humans and other pedagogical agents
(Ryu & Baylor, 2005). However, the instrument may also hold relevance for evaluating a
software agent’s persona even in the absence of a visual representation (i.e., the agent is not
visually present). First, it should be noted that not all software agents are visually represented
(Moreno, 2005), and the questions on the API do not specifically refer to the agent’s visual
representation (see Ryu & Baylor, 2005). In addition, the API’s subscales measure participant’s
perceptions of how well the learning was facilitated (10 questions; Cronbach’s α = .95), how
credible it was (five questions; Cronbach’s α = .90), its human-likeness (five questions;
Cronbach’s α = .93), and how engaging it was (five questions; Cronbach’s α = .90), which are all
salient constructs for this study. Finally, the instrument has been used in prior work in the area
(Craig & Schroeder, 2017), thus allowing for comparisons to previous research of the voice
effect. Hence, the instrument was used to evaluate participants’ perceptions of the presentation
from different types of voice, and responses for the questions corresponding to the four subscales
of the API were summed to give a composite score.
Learning Measures. A pretest measure included demographics survey (age, gender) and
a general meteorological knowledge test that included a check list of seven items and a self-
report rating meteorological knowledge. This pretest measure can be found in Moreno and
Mayer (1999). Three different learning outcome measures were used, a multiple-choice test, a
retention test, and a transfer test. The multiple-choice test consisted of six questions (Craig,
Gholson, & Driscoll, 2002). The retention test consisted of one free recall question and the
transfer test consisted of four open-ended questions (Mayer & Moreno, 1998). Two coders
scored each answer. The two raters had an inter-rater reliability (Cohen’s Kappa) of κ = .68 for
retention and a κ = .62 transfer questions, which is considered substantial agreement (Cohen,
1960; McHugh, 2012). Disagreements were reconciled with consultation of a third rater.
Mental Effort Scale. The mental effort scale (Paas, 1992) was answered by participants
twice during the experiment. The scale is a one-item scale frequently used as a self-reported
measure of perceived mental effort. Participants completed the item for the first time
immediately after the learning phase of the study (In studying the preceding video I invested).
The participants also answered the question after completion of the testing phase (In solving or
studying the preceding problems I invested). Participants responded to both items using Paas’s
(1992) 9 point scale ranging from 1 (Very Very low mental effort) to 9 (Very Very High Mental
Effort).
Scores on these items were used to calculate both training and testing efficiency scores.
The formulas for calculating training and instructional efficiency can be found in Paas,
Tuovinen, Tabbers, and van Gerven (2003). In short, training efficiency refers to the mental
effort invested into the learning phase in relation to learning performance, while instructional
efficiency refers to the mental effort invested into the testing phase in relation to learning
performance (Paas et al., 2003).
Procedure
Participants were provided a link to the experimental website (Qualtrics.com) for data
collection. Participants received an online informed consent followed by a demographics and
pretest survey. They were then randomized into one of the conditions and watched the two-
minute video. This was followed by the mental effort question. Then the learning assessments
were given in the following order: the multiple choice test (not timed), the retention question
(five minutes) and four transfer question presented one at a time for three minutes each. The
learning assessments were followed by another mental effort question and the API questionnaire
was completed as the final assessment.
Results
Initial tests of variance
Levene’s test of Homogeneity of Variances was performed for each dependent measure.
These tests indicated that variances were equal among conditions for all measures. So, ANOVA
tests were used to assess potential differences.
Perceptions
Participant’s ratings of their perceptions indicated differences among the three voice
conditions (Table 1). There was a significant difference in participant’s ratings on the human-like
subscale among the voices, F(2,87) = 14.08, p = .00; ηp2 = .25. As would be expected, the human
voice received significantly higher ratings than the classic voice engine (Md = 4.42; p = .001;
Cohen d = 1.41) and the modern voice engine (Md = 5.95; p = .001; Cohen d = 0.95). There were
no significant differences between the computerized voices (Md = 1.53; p = .19; Cohen d = 0.34).
However, the means were in the expected direction with the modern voice engine being rated
higher.
The same significant pattern was seen for rating on the subscale that measured
engagement, F(2,87) = 3.56, p = .03; ηp2 = .08. The human voice received significantly higher
ratings than the classic voice engine (Md = 2.69; p = .03; Cohen d = 0.63) and the modern voice
engine (Md = 2.68; p = .02; Cohen d = 0.61). There were no differences between the
computerized voices (Md = .01; p = .99; Cohen d =0.00).
There were no significant differences found in the participants perceptions of how well
the voice facilitated learning (F(2,87) = 2.25, p = .11; ηp2 = .05) or its credibility (F(2,87) = 1.85,
p = .16; ηp2 = .04).
Table 1.
Means and standard deviations for participants’ ratings on the subscales of the
Agent Persona Instrument separated by condition.
Facilitates
Learning
Credibility
Human-
Like
Engaging
Voice
Condition
N M(SD) M(SD) M(SD) M(SD)
Classic
28
30.14(9.87)
17.29(4.12)
9.29(3.96)
Modern
32
29.91(9.20)
17.28(4.12)
10.81(4.90)
Human
30
34.47(9.15)
19.13(4.66)
15.23(4.42)
Learning Measures
An ANOVA was conducted across the participants four learning measures (pre-test,
retention test, multiple choice test, and transfer test) to determine differences between conditions.
All means and standard deviations can be found in Table 2. The ANOVAs performed on the
pretest (F(2,147) = 0.31, p = .73; ηp2 = .00), the retention test (F(2,147) = 0.56, p = .57; ηp2 =
.00), and the transfer test (F(2,147) = 0.86, p = .98; ηp2 = .08) did not indicate any significant
differences among conditions. However, a significant difference was found among conditions for
the multiple choice test, F(2,147) = 6.34, p = .002; ηp2 = .08. LSD post hoc tests indicated that
participants learning from the classic voice engine were outperformed by participants learning
from the modern voice engine (Md = .15; p = .01; Cohen d = .59) and the human voice (Md = .18;
p = .001; Cohen d = .65). However, there were no significant learning differences observed
between participants receiving the presentation with the modern voice engine and the human
voice (Md = .02; p = .64; Cohen d = .10).
Table 2.
Means and standard deviations for participants’ scores on multiple choice
(proportion correct), retention, and transfer tests separated by condition.
Pretest
Multiple
choice
Retention Transfer
Voice
Condition
N M(SD) M(SD) M(SD) M(SD)
Classic
47
4.70(2.12)
.48(.28)
1.79(2.24)
1.15(1.49)
Modern
53
4.36(2.12)
.64(.25)
2.21(2.45)
1.06(1.23)
Human
50
4.58(2.42)
.66(.28)
2.26(2.51)
1.16(1.45)
Cognitive Efficiency Measures
A series of ANOVAs were performed on the calculated training and instructional
efficiency scores to determine if there were cognitive efficiency differences across conditions.
There were significant differences observed for training efficiency with multiple-choice
measures, F(2, 147) = 4.07, p = .02; ηp2 = .05. LSD post hoc tests indicated that participants
learning from the classic voice engine reported significantly lower efficiency than participants
learning from the modern voice engine (Md = .36; p = .01; Cohen d = .58). Participants in the
human voice condition reported better efficiency than the classic voice, but not significantly so
(Md = .24; p = .07; Cohen d = .35). There were no significant learning differences observed
between participants receiving the presentation with the modern voice engine and the human
voice (Md = .01; p = .31; Cohen d = .21). The participants’ training efficiency scores were not
significant among conditions for retention, F(2, 147) = 1.05, p = .35; ηp2 = .01, or transfer tests,
F(2, 147) = 0.50, p = .61; ηp2 = .007. Additionally, the ANOVAs performed on each of the
instructional efficiency measures were not significant (Multiple choice: F(2, 88) = 1.16, p = .32;
ηp2 =.03, Retention: F(2, 88) = 1.10, p = .34; ηp2 =.02, Transfer: F(2, 88) = 1.90, p = .16; ηp2 =
.04). Descriptive statistics are provided in Table 3.
Table 3.
Means and standard deviations for participants’ training and instructional efficiencies based for
each learning measure separated by condition.
Training Efficiency
Instructional Efficiency
N
Multiple
choice
Retention Transfer
N
Multiple
choice
Retention Transfer
Voice
Condition
M(SD) M(SD) M(SD) M(SD) M(SD) M(SD)
Classic
47
-.19(0.68)
-.06(0.64)
.02(0.69)
28
.01(0.68)
.15(.58)
.42(0.76)
Modern
53
.17(0.58)
.12(0.70)
.07(0.69)
33
.20(0.56)
.22(.67)
.25(0.69)
Human
50
.04(0.67)
-.05(0.72)
-.06(0.70)
30
-.01(.58)
-.01(.65)
.08(0.57)
Discussion
The analyses of the API subscales showed that human voices were rated significantly
higher than the computerized voices in relation to perceived human-likeness and engagement,
while no differences were found in regards to the facilitation of learning or credibility. These
results are somewhat contradictory to those of Craig and Schroeder (2017). While Craig and
Schroeder also found that the human voice outperformed the synthesized voices in regards to
perceived human-likeness and engagement, they found that the classic voice was rated worse
than the modern or human voices in regards to facilitation of learning and credibility. The
findings could differ because in this study there was not a visually represented virtual human
within the learning environment. Thus, the results of the two studies taken together extend social
agency theory by highlighting how the appearance of virtual humans within a learning
environment may change the nature of how certain aspects of a social interaction may be
perceived by the learner, thus adding valuable data to the under-researched area of how learners
perceive different sources of audio narration in different contexts.
In terms of learning performance, participants receiving information with a human voice
outperformed the older voice engine on the multiple-choice measure; a result which replicates
previous findings (Atkinson et al., 2005; Mayer et al., 2003), but only for this one learning
measure. However, we note that participants in the modern voice engine condition also
outperformed those in the older voice engine condition, and those in the human voice condition
did not significantly outperform those in the modern speech engine on any of the learning
measures. These results contradict the predictions of the voice effect (Mayer, 2014b), support the
findings of recent studies (Craig & Schroeder, 2017; Mayer & DaPra, 2012), and support earlier
results showing that the natural variations in human voices may not be needed for
comprehension (Remez et al., 1981). Furthermore, while previous evidence had been found in
two studies when virtual humans deliver the narration (Craig & Schroeder, 2017; Mayer &
DaPra, 2012), this study provides consistent evidence in the context of multimedia animations
without the physical presence of a virtual human. If these results replicate across additional
contexts, it would indicate that modern text-to-speech voice engines, which often are still
missing the inflection and cadence of standard human voices, may have reached a sufficient level
of clarity to provide equivalent learning to human voices in multimedia environments.
Consistent with our findings in regards to learning outcomes, the cognitive efficiency
measures do not provide support for the voice effect. Participants in the human voice conditions
had similar efficiency scores to both voice engine conditions for all cognitive efficiency
measures except one. In the one measure where there was a difference, training efficiency for the
multiple-choice measure, the human voice condition had significantly worse efficiency scores
than the modern voice engine. This provides evidence that the modern voice engine does not
appear to require significantly more mental effort in relation to the learning outcomes achieved.
The current study also a replication of a previous finding favoring a high quality synthetic
voice, which suggests that high quality synthetic voices may provide similar, or in some cases
improved, learning outcome scores compared to recorded human voices. Using the same voice
engines and human voice in the current study, Craig and Schroeder (2017) found that a high
quality synthetic voice paired with a virtual agent performed better on learning transfer measures
than a virtual human using a human voice. While the current effect was observed for recall-based
multiple-choice questions and not observed in retention and transfer measures, it does provide
further evidence for a synthetic voice effect for improving learning in some cases. While both
studies showed equivalent differences in learning between modern voice engines and human
voices and improved learning compared to older voice engine comparison, the two studies
differed in the type of learning impacted. While this difference deserves additional research, it is
possible the difference is due to the presence of a virtual human. A previous study found that a
likeable virtual human had a greater influence on transfer outcomes then retention outcomes, and
a second experiment showed that a virtual human’s appearance influenced transfer but not
retention outcomes compared to a no-agent condition (Domagk, 2010).
Previous research lends additional support toward explaining this synthetic voice effect.
Human voice patterns have a naturally changing sequence of linguistic elements (Liberman,
Cooper, Shankweiler, & Studdert-Kennedy, 1967), which makes them very different from the
flat monotone intonation of many computer-generated voices. However, this difference is not
required for listener’s identification of utterances (Remez et al., 1981). The current findings seem
to be in agreement with the Liberman et al.’s and Remez et al.’s results. The data from the
current study and the previous Craig and Schroeder study (2017) would seem to indicate that
learners were sensitive to the human voice and felt the voice was more engaging. However, as
evidenced by the scores on the credibility and facilitated learning scales, the learners did not feel
that the higher quality modern voice impacted overall quality in terms of perceived ability to
convey information. The lack of learning differences provides further evidence of this.
The evidence presented above does not mean that synthetic voices should not continue to
be improved. Addressing the differences between synthetic voices and human voices could
further allow for personalization and support further improvements for learning and perceptions
of the learning environments. Dialect is an excellent example of this potential. While the
intonation of most modern voice engines are often still monotone, dialect of the voice presenting
information can have an impact on academic performance when the dialect reflects the students
own dialect (Finkelstein, 2015; Finkelstein, Yarzebinski, Vaughn, Ogan, & Cassell, 2013). In
Finkelstein et al’s study, the science performance of third grade students was measured after
interacting with a “distant peer” technology that employed different dialect use patterns. They
found that all native speakers of African American Vernacular English (AAVE) demonstrated
the strongest science performance when the technology used AAVE features consistently
throughout the interaction. Accordingly, the literature indicates that continued investigation into
different types of synthesized voices and their influence on learning and perceptions will
continue to be important into the future.
A limitation of the current study is that the API analysis had missing data, with a total of
only 90 participants instead of the 150 for the full study. This missing data appeared to be
random in relation to condition. The most likely cause was that the API was presented at the end
of the study and could easily be skipped, unlike the learning measures that had a timer.
Additionally, the API with 25 questions was a fairly long test. This finding could point toward
the need for additional research on the measure to determine if a shorter test could be
constructed. Future studies using the instrument should consider instrument placement within the
experimental procedures for optimal data collection.
In addition, the current study used Moreno and Mayer’s (1998) narrative on lightning
formation. This resulted in a presentation was a little over two minutes in length. While these
short duration videos are common within the multimedia learning research area (Mayer, 2009), it
is not known if the current results would hold for longer or more complicated material. However,
the current findings are promising since e-learning environments such as MOOCS tend to use
short video clips, or mini-lectures, instead of longer video (Scagnoli, McKinney, & Moore-
Reynen, 2015) with many of online lectures being only minutes long (Crook & Schofield, 2017).
In online learning courses, when longer videos were provided students only engaged with them
for six minutes regardless of the length of the video (Guo, Kim, & Rubin, 2014).
Participants in the current study were recruited using Amazon MTurk. The question here
is whether the potential risk from the participant’s reduced attentional effort from the lack of
experimental control is a detriment to the study. Previous work has shown limited differences
between MTurk populations and comparison populations on performance (Casler, Bickel, &
Hackett, 2013; Mason & Suri, 2011), so this population does not pose a likely threat to the
current findings. Within the Amazon MTurk population, the risk of decreased attention can be
lessened by selecting reliable participants that have a 95% or above Human Intelligence Task
(HIT) approval rating (Paolacci & Chandler, 2014). Further, Hauser and Schwarz (2015) found
that the Amazon MTurk participants were more attentive to instructions when attentiveness to
instructions was compared between Amazon MTurk and subject pool populations. Thus, the
potential risk from the lack of experimental control in the MTurk population is not viewed as a
plausible threat to the current findings. Because participants had control of their own learning
environment, it seems plausible that the environment could be similar to what an online learner
may experience. Hence, the limited control over the MTurker’s learning environment may not be
a detriment, but rather a benefit as it could more closely replicate an online learning situation
than a laboratory-based study may be able to accomplish.
Conclusion
The voice effect suggests that using recorded human voices to provide narration in
multimedia learning environments will provide better learning outcomes than using computer-
generated voices (Mayer, 2014b). However, for those who design learning content, finding a
content expert confident enough to discuss the topics and willing to record numerous narratives
for inclusion in learning environments can be challenging, hence highlighting the benefits of a
text-to-speech tool if they are consistently as effective as recorded human voices.
Using a randomized pretest-posttest design, it was found that, by and large, there were
minimal differences in the ways that participants perceived and learned from a modern
computer-generated voice compared to a recorded human voice. As expected, those learning
with the recorded human voice perceived it to be more engaging and human-like than the
machine generated voices, but there were no significant differences in the participants’
perceptions of how well the voices facilitated learning or how credible they were. Our analysis of
the learning and cognitive efficiency measures showed mixed support for the voice effect, with
most results not providing significant support for the effect. However, it should be noted that the
current study was in a non-interactive multimedia environment. It is possible that these results
might not replicate in interactive environments where there could be higher expectations of
responsiveness to the dynamic interaction. Additionally, this study was performed with one
example of a modern text-to-speech voice engine and human voice. Hence, additional studies are
required to determine the generalizability of these findings. However, if this pattern replicates in
different contexts, it will show that voice engines have reached an acceptable level of
performance for use within learning technologies. This finding could result in the creation of
more dynamic and less expensive learning technologies, as currently research would suggest that
any narrative should be a recorded human voice.
References
Atkinson, R. K., Mayer, R. E., & Merrill, M. M. (2005). Fostering social agency in multimedia
learning: Examining the impact of an animated agent’s voice. Contemporary Educational
Psychology, 30, 117-139.Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal?
A comparison of participants and data gathered via Amazon’s MTurk, social media, and
face-to-face behavioral testing. Computers in Human Behavior, 29, 2156-2160.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 37-46
Craig, S. D. & Schroeder, N. L. (2017). Reconsidering the voice effect when learning from a
virtual human. Computers & Education, 114, 193-205.
Craig, S. D., Gholson, B., & Driscoll, D. (2002). Animated pedagogical agents in multimedia
educational environments: Effects of agent properties, picture features, and redundancy.
Journal of Educational Psychology, 94, 428–434.
Crook, C. & Schofield, L. (2017). The video lecture. The Internet and Higher Education, 34, 56-
64.
Domagk, S. (2010). Do pedagogical agents facilitate learning motivation and learning outcomes?
Journal of Media Psychology, 22(20), 84-97.
Finkelstein, S. (2015, June). Educational Technologies to Support Linguistically Diverse
Students, and the Challenges of Classroom Integration. In International Conference on
Artificial Intelligence in Education (pp. 836-839). Springer.
Finkelstein, S., Yarzebinski, E., Vaughn, C., Ogan, A., & Cassell, J. (2013, July). The effects of
culturally congruent educational technologies on student achievement. In International
Conference on Artificial Intelligence in Education (pp. 493-502). Springer Berlin
Heidelberg.
Guo, P. J., Kim, J., & Rubin, R. (2014). How video production affects student engagement: An
empirical study of MOOC videos. Proceedings of the First ACM Conference on Learning
@ Scale Conference (pp. 4150). New York, NY, USA: ACM.
Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on
online attention checks than do subject pool participants. Behavior Research Methods,
48(1), 400-407.
Kalyuga, S. (2011). Cognitive load theory: How many types of load does it really need?
Educational Psychology Review, 23(1), 1-19.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception
of the speech code. Psychological review, 74(6), 431-461.
Mayer, R. E. (2009). Multimedia Learning (2nd ed.). New York: Cambridge University Press.
Mayer, R. E. (2014a). The Cambridge handbook of multimedia learning (2nd ed.). New York:
Cambridge University Press.
Mayer, R. E. (2014b). Principles based on social cues in multimedia learning: Personalization,
voice, image, and embodiment principles. In. R. E. Mayer (Ed.), The Cambridge
handbook of multimedia learning (2nd ed.)(pp. 345-368.). New York, NY: Cambridge
University Press.
Mayer, R. E., & DaPra, C. S. (2012). An embodiment effect in computer-based learning with
animated pedagogical agents. Journal of Experimental Psychology: Applied, 18(3), 239-
253.
Mayer, R. E., Sabko, K., & Mautone, P. D. (2003). Social cues in multimedia learning: Role of
speaker’s voice. Journal of Educational Psychology, 95(2), 419-425.
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22¸ 276-
282.
Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon’s Mechanical Turk.
Behavior Research Methods, 44(1), 1-23.
Moreno, R. (2005). Multimedia learning with animated pedagogical agents. In R. E. Mayer
(Ed.). The Cambridge Handbook of Multimedia Learning (pp. 507-523). New York, NY:
Cambridge University Press.
Moreno, R., & Mayer, R. E. (1999). Cognitive principles of multimedia learning: The role of
modality and contiguity. Journal of Educational Psychology, 91(2), 358-368.
Paas, F. (1992). Training strategies for attaining transfer of problem-solving skill in statistics: A
cognitive-load approach. Journal of Educational Psychology, 84, 429-434.
Paas, F., & Sweller, J. (2014). Implications of cognitive load theory for multimedia learning. In
R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (2nd ed.)(pp. 27-
42.). New York, NY: Cambridge University Press.
Paas, F., Tuovinen, J. E., Tabbers, H., & Van Gerven, P. W. M. (2003). Cognitive load
measurement as a means to advance cognitive load theory. Educational Psychologist,
38(1), 63-71.
Paolacci, G., & Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a
Participant Pool. Current Directions in Psychological Science, 23, 184–188.
Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception without
traditional speech cues. Science, 212(4497), 947-949.
Ryu, J., & Baylor, A. L. (2005). The psychometric structure of pedagogical agent persona.
Technology Instruction Cognition and Learning, 2(4), 291-314.
Scagnoli, N. I., McKinney, A., & Moore-Reynen, J. (2015). Video lectures in eLearning. In F.
M. Nafukho & B. J. Irby (Eds.), Handbook of Research on Innovative Technology
Integration in Higher Education (pp.115-134). Hershey, PA: IGI Global.
Sweller, J. (2010). Element interactivity and intrinsic, extraneous, and germane cognitive load.
Educational psychology review, 22(2), 123-138.
... Users with print disabilities primarily need to interact with text in ways that mitigate their disability, making reading more manageable and comfortable. Based on research in Assistive Technology (AT), particularly focusing on Magnification (Stearns et al. 2018;Evans & Blenkhorn 2008), Text-to-Speech (TTS) (Craig & Schroeder 2019;O'Malley 1995), Screen reader (SR) (Oh et al. 2021;Guo et al. 2016), and the utilization of AR (Mambu et al. 2019) the following priority list and feature set have been developed: ...
... Answering the primary research question, "How can Universally Designed Augmented Reality enhance accessible and usable reading experiences for users of varying skills and abilities?" The findings suggest that AReader significantly advances accessibility by incorporating assistive features and strengths of text-to-speech (Craig & Schroeder 2019;O'Malley 1995) and screen reader technologies (Oh et al. 2021;Guo et al. 2016) in one universal AR solution. All test participants, regardless of their skills, abilities were able to use the prototype and consume the paperbased textual content. ...
... In contrast, the Text-to-AI-Video Narration appealed more to younger users and those with dyslexia, finding the natural-sounding, human-like interpretations highly engaging. Notably, users with dyslexia expressed a preference for the human sounding (Craig & Schroeder 2019) and human-like AI (Tang et. al. 2008) narration over visual text guidance. ...
Chapter
Full-text available
This paper introduces AReader, an innovative approach that combines Universal Design (UD) and Augmented Reality (AR) to enhance the universal accessibility of paper-based textual content. It targets a broad audience, including individuals with print and situational disabilities, emphasizing that reading challenges can affect anyone. The paper details the practical implementation of UD in AR, focusing on AR's context-aware, multimodal capabilities, and highlights a novel Visual Screen Reader feature. This feature is presented in two concepts: Text-to-Audio Narration with Visual Reading Guide and Text-to-AI-Video Narration. The paper also outlines a set of features and hands-on UD practices adaptable to a wide variety of AR applications. User testing has demonstrated that the prototype effectively enhances reading accessibility and usability across various user groups. The findings suggest that integrating UD principles with AR technologies can foster inclusive solutions, benefiting not only individuals with special needs but also contributing to broader societal inclusion.
... In this study, we need to reexamine this voice effect. Moreover, despite researchers studying the impact of machine voices on credibility, social aspects, learning motivation and cognitive load (Craig & Schroeder, 2019;Edwards et al., 2018;Kim et al., 2022;Liew et al., 2020;Schroeder et al., 2021), such studies are still in their early stages. The effects of machine voices on emotions, the uncanny valley and parasocial interaction are still unknown, and the present study assesses these dimensions of experience extensively. ...
Article
Full-text available
Generative AI (GAI) and AI‐generated content (AIGC) have been increasingly involved in our work and daily life, providing a new learning experience for students. This study examines whether AI‐generated instructional videos (AIIV) can facilitate learning as effectively as traditional recorded videos (RV). We propose an instructional video generation pipeline that includes customized GPT (Generative Pre‐trained Transformer), text‐to‐speech and lip synthesis techniques to generate videos from slides and a clip or a photo of a human instructor. Seventy‐six students were randomly assigned to learn English words using either AIIV or RV, with performance assessed by retention, transfer and subjective measures from cognitive, emotional, motivational and social perspectives. The findings indicate that the AIIV group performed as well as the RV group in facilitating learning, with AIIV showing higher retention but no significant differences in transfer. RV was found to offer a stronger sense of social presence. Although other subjective measures were similar between the two groups, AIIV was perceived as slightly less favourable. However, the AIIV was still found to be moderately to highly attractive, addressing concerns related to the uncanny valley effect. This research demonstrates that AIGC can be an effective tool for learning, offering valuable implications for the use of GAI in educational settings. Practitioner notes What is already known about this topic Instructional videos, especially those featuring a teacher's presence, have been widely used in second language learning to facilitate learning. Producing instructional videos is costly and burdensome. Generative AI has great potential for generating educational content. What this paper adds An AI‐generated instructional video (including generated lecture text, voice and appearance) demonstrated greater improvement in students' retention performance in English word learning than a traditional recorded video. Students perceived no significant differences between the AI‐generated instructional video and recorded video in satisfaction, motivation, trust, cognitive load, emotions and parasocial interaction dimensions, although the AI‐generated instructional video group reported slightly lower values. Despite AI‐generated instructional video eliciting a significantly lower value of social presence than recorded video, it led to a reduction in cognitive load and better performance. Implications for practice and/or policy We recommend using the AI‐generated instructional video in both physical and online classes for its positive effects on both learning achievement and learning experience. The findings indicate the equivalence principle in AI‐generated content, highlighting that the appearance, voice and lecture text generated by current AI technology have reached a certain level of quality.
... This phenomenon is said to be the voice effect. With the recent adoption of virtual humans, TTS voices might be as effective as human voices (Craig and Schroeder 2019;Chiou et al. 2020). With the increasing improvements in the field and the development of virtual humans (VHs), it is critical to understand how these technologies impact different sectors or the attitudes of the people. ...
Article
Full-text available
This article examines public perceptions of virtual humans across various contexts, including social media, business environments, and personal interactions. Using an experimental approach with 371 participants in the United Kingdom, this research explores how the disclosure of virtual human technology influences trust, performance perception, usage likelihood, and overall acceptance. Participants interacted with virtual humans in simulations, initially unaware of their virtual nature, and then completed surveys to capture their perceptions before and after disclosure. The results indicate that trust and acceptance are higher in social media contexts, whereas business and general settings reveal significant negative shifts post-disclosure. Trust emerged as a critical factor influencing overall acceptance, with social media interactions maintaining higher levels of trust and performance perceptions than business environments and general interactions. A qualitative analysis of open-ended responses and follow-up interviews highlights concerns about transparency, security, and the lack of human touch. Participants expressed fears about data exploitation and the ethical implications of virtual human technology, particularly in business and personal settings. This study underscores the importance of ethical guidelines and transparent protocols to enhance the adoption of virtual humans in diverse sectors. These findings offer valuable insights for developers, marketers, and policymakers to optimise virtual human integration while addressing societal apprehensions, ultimately contributing to more effective and ethical deployment of virtual human technologies.
... It was emphasized that while learners may perceive differences in the voices of the presenters, these distinctions are insufficient to impact their trust in the information presented. Their subsequent research demonstrated that, despite participants reporting a preference for human voice, no significant differences were evident in learning outcomes or cognitive efficiency measures (Craig & Schroeder, 2019). Similarly, Davis et al. (2019) demonstrated that even in non-native contexts, modern computer-generated speech can be as effective as human voice in knowledge retention. ...
Article
Full-text available
The use of video and paper-based materials is commonly widespread in foreign language learning (FLL). It is well established that the level of acceptance of these materials influences learning outcomes, but there is lack of evidence regarding the use and related impact of videos generated by artificial intelligence (AI) on these aspects. This paper used linear mixed models and path analysis to investigate the influence of student acceptance of AI-generated short videos on learning outcomes compared to paper-based materials. Student acceptance was assessed based on perceived ease of use (PEU), perceived usefulness (PU), attitude (A), intentions (I), and concentration (C). The results indicate that both AI-generated short videos and paper-based materials can significantly enhance learning outcomes. AI-generated short videos are more likely to be accepted by students with lower pre-test scores and may lead to more significant learning outcomes when PEU, PU, A, I and C are at higher levels. On the other hand, paper-based materials are more likely to be accepted by students with higher pre-test scores and may lead to more significant learning outcomes when PEU, PU, A, I and C are at lower levels. These findings offer empirical evidence supporting the use of AI-generated short videos in FLL and provide suggestions for selecting appropriate learning materials in different FLL contexts.
... Synthetic speech has shown to be on par with human speech in terms of learning and word recognition (Craig & Schroeder, 2017;Schwab et al., 1985). For instance, Craig and Schroeder (2018) revealed no significant differences in learning outcomes or cognitive efficiency between the modern TTS voice and human voice groups. Additionally, Davis et al. (2019) showed no significant differences between modern TTS speech and human speech in the retention of non-native speakers. ...
Article
Full-text available
Background The integration of Text‐to‐Speech (TTS) and virtual reality (VR) technologies in K‐12 education is an emerging trend. However, little is known about how students perceive these technologies and whether these technologies effectively facilitate learning. Objectives This study aims to investigate the perception and effectiveness of TTS voices and VR agents in a K‐12 classroom setting, with a focus on information recall. Methods Using a recent TTS architecture, we developed four different synthetic voices based on 5, 10, 15 and 20 h of training materials. Two experiments were conducted involving students in a K‐12 setting. The first experiment examined students' evaluations of TTS voices with varying hours of training material and the impact on information recall. The second experiment assessed the effect of pairing TTS voices with a VR agent on students' perception and recall performance. Results and Conclusions Human voices received superior quality ratings over TTS voices within the classroom context. The integration of a VR agent was found to enhance the perception of TTS voices, aligning with existing literature on the positive impact of virtual agents on speech synthesis. However, this incorporation did not translate to improved recall, suggesting that the student focus may have been compromised by the VR agent's novelty and its design limitations.
Article
The rapid development of artificial intelligence technology has significantly improved the quality of computer-synthesized voices in modern text-to-speech (TTS) engines. Various appealing attributes can be added to these synthesized voices to support their widespread use in instructional videos. However, whether such synthesized voices can replace high-quality human-recorded voices remains uncertain. We conducted an eye-tracking experiment to examine the learning outcomes of instructional videos. We compared differences in learning performance, attentional engagement, and persona perceptions between a human-recorded voice and two computer-synthesized voices (formal and cute) generated by a modern TTS engine. Thirty university students participated in this study, with their eye movements recorded and analyzed as they watched instructional videos featuring different forms of narration. Overall, no statistically significant differences were found in persona perceptions between participants who learned from the human-recorded voice and those who learned from the two synthesized voices. However, the human-recorded voice significantly improved learning performance and attentional engagement. Our results indicate that while the quality of software-generated voices has reached a relatively high level of perception, it does not positively influence learning performance and attention. Therefore, we recommend that instructional video designers prioritize human-recorded voices over software-synthesized voices.
Research
Full-text available
Reading and writing skills, especially in the U.S., fall short of community expectations. To facilitate significant improvement in reading and writing, teachers need tools that leverage digital technology to increase student engagement while practicing skills in both domains. Pixton is a comic creation platform where learners can tap into their creativity and build stories customized to their interests. Pixton can be used as a writing tool to improve English learning, add fun to grammar and vocabulary lessons, and help learners understand how to organize ideas into coherent writing. This report presents ESSA Evidence for the research base of Pixton comic maker for Level 4 or IV, Demonstrates a Rationale, including a logic model and literature review that connects academic research studies to features in the product that support learning outcomes. In addition, Pixton has had multiple peer-reviewed studies demonstrating its effectiveness in a variety of settings. Those studies bring Pixton's ESSA Evidence to Moderate, Level 2.
Article
The goal of this study was to determine whether learning processes and outcomes are affected by the emotional tone of the computer-generated voice in a narrated slideshow. In a between-subjects design, participants viewed a narrated slideshow on lightning formation involving a computer-generated female (Experiment 1) or male voice (Experiment 2) that displayed happy, content, angry, or sad emotion. On subsequent surveys and post-tests, students could recognize positive or negative emotions conveyed by the female voice but not the male voice. The emotional tone of the instructor’s voices had minimal impact on students’ ratings of felt emotion during learning, ratings of social connection with the instructor, or scores on retention and transfer tests of learning outcome. The findings highlight the limitations of computer-generated voices to convey emotions that trigger affective, social, and cognitive processes when an onscreen instructor is absent, thereby suggesting a boundary condition for the cognitive-affective model of e-learning.
Article
Full-text available
The current paper investigates an essential design component of virtual humans, the voice they communicate with, by examining the impact of varied voice types. A standard voice effect has held that human voices should be paired with virtual humans. The current study revisits this effect. In a randomized trial, virtual humans used one of three voice types (classic and modern text-to-speech engines, as well as human voice) to present information to a sample of participants from an online population. The impact of each voice type on learning, cognitive load, and perceptions of the virtual human were examined. The study found that the modern voice engine produced significantly more learning on transfer outcomes, had greater training efficiency, and was rated at the same level as an agent with a human voice for facilitating learning and credibility while outperforming the older speech engine. These results call into question previous results using older voice engines and the claims of the voice effect.
Conference Paper
Full-text available
Dialectal differences are one explanation for the systematically reduced test scores of children of color compared to their Euro-American peers. In this work, we explore the relationship between academic performance and dialect differences exhibited in a learning environment by assessing 3rd grade students’ science performance after interacting with a “distant peer” technology that employed one of three dialect use patterns. We found that our participants, all native speakers of African American Vernacular English (AAVE), demonstrated the strongest science performance when the technology used AAVE features consistently throughout the interaction. These results call for a re-examination of the cultural assumptions underlying the design of educational technologies, with a specific emphasis on the way in which we present information to culturally-underrepresented groups.
Article
Full-text available
Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.
Article
Vocabulary for describing the structures, roles, and relationships characteristic of traditional, or ‘offline’, education has been seamlessly applied to the designs of ‘online’ education. One example is the lecture, delivered as a video recording. The purpose of this research is to consider the concept of ‘lecture’ as realized in both offline and online contexts. We explore how media differences entail different student experiences and how these differences relate to design decisions associated with each. We first identify five features of traditional lecturing that have been invoked to understand its impact. We then describe a taxonomy of online lecture design derived from digital artefacts published within web-based courses. Analysis of this taxonomy reveals six design features that configure differently the experience of lectures in the two presentational formats: classroom and video. Awareness of these differences is important for the practitioner who is now increasingly involved in developing network-based resources for learning.
Book
For hundreds of years verbal messages such as lectures and printed lessons have been the primary means of explaining ideas to learners. Although verbal learning offers a powerful tool, this book explores ways of going beyond the purely verbal. Recent advances in graphics technology have prompted new efforts to understand the potential of multimedia and multimedia learning as a means of promoting human understanding. In Multimedia Learning, Second Edition, Richard E. Mayer asks whether people learn more deeply when ideas are expressed in words and pictures rather than in words alone. He reviews twelve principles of instructional design that are based on experimental research studies and grounded in a theory of how people learn from words and pictures. The result is what Mayer calls the cognitive theory of multimedia learning, a theory introduced in the first edition of Multimedia Learning and further developed in The Cambridge Handbook of Multimedia Learning.
Conference Paper
Though one of the main benefits of educational technologies is their ability to provide personalized instruction, many systems are still built with a one-size-fits-all approach to culture. In our work, we’ve demonstrated that there may be learning benefits when technologies use the same non-standard dialects as students, but that educators are likely to be initially resistant to technologies that bring non-standard dialect practices into the classroom. Based on what we have uncovered about teachers’ needs and expectations regarding this type of classroom technology, our future work will investigate how systems designed to align with these needs may be able to support both students and teachers in this complex educational problem and promote a positive classroom culture.
Chapter
Video presentations, also referred to as mini-lectures, micro-lectures, or simply video lectures, are becoming more prominent among the strategies used in hybrid or fully online teaching. Either interested in imitating a Khan Academy style of presenting content or responding to other pedagogical or administrative needs, there are more instructors now considering the creation of short video lectures for their courses than before. This chapter examines the use of video lectures in online and hybrid courses, describes the design and application of them in graduate and undergraduate courses, and analyzes primary and secondary data results to expose strengths, weaknesses, opportunities, and challenges experienced in the development and implementation of this technique. The use of short video lectures is a regular practice in MOOCs and has the potential of becoming a successful practice, especially with the expansion of new approaches such as the flipped classroom.
Chapter
Social cues may prime social responses in learners that lead to deeper cognitive processing during learning and hence better test performance. The personalization principle is that people learn more deeply when the words in a multimedia presentation are in conversational style rather than formal style. This principle was supported in 14 out of 17 experimental tests, yielding a median effect size of d = 0.79. Some important boundary conditions are that the personalization principle may not apply to high-achieving students or long lessons. The voice principle is that people learn more deeply when the words in a multimedia message are spoken in a human voice rather than in a machine voice. This principle was supported in 5 out of 6 experimental comparisons, with a median effect size of d = 0.74. A possible boundary condition is that the voice principle may not apply when there are negative social cues such as low embodiment. The image principle is that people do not necessarily learn more deeply from a multimedia presentation when the speaker’s image is on the screen rather than not on the screen. This principle is based on 14 experimental tests in which half produced negative or negligible effects, yielding a median effect size of d = 0.20. The embodiment principle is that people learn more deeply when on-screen agents display humanlike gesturing, movement, eye contact, and facial expressions. In 11 out of 11 experimental comparisons, people performed better on transfer tests when they learned from a high-embodied agent than from a low-embodied agent, yielding a median effect size of d = 0.36. A possible boundary condition is that the embodiment principle may not apply when there are negative social cues such as a machine voice.