Chapter

AI-Based Visualization of Voice Characteristics in Lecture Videos’ Captions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

More and more educational institutions are making lecture videos available online. Since 100+ empirical studies document that captioning a video improves comprehension of, attention to, and memory for the video [1], it makes sense to provide those lecture videos with captions. However, studies also show that the words themselves contribute only 7% and how we say those words with our tone, intonation, and verbal pace contributes 38% to making messages clear in human communication [2]. Consequently, in this paper, we address the question of whether an AI-based visualization of voice characteristics in captions helps students further improve the watching and learning experience in lecture videos. For the AI-based visualization of the speaker’s voice characteristics in the captions we use the WaveFont technology [3–5], which processes the voice signal and intuitively displays loudness, speed and pauses in the subtitle font. In our survey of 48 students, it could be shown that in all surveyed categories—visualization of voice characteristics, understanding the content, following the content, linguistic understanding, and identifying important words—always a significant majority of the participants prefers the WaveFont captions to watch lecture videos.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Generally, educational videos are proposed to be effective to the extent that they acknowledge human cognitive processes (Cognitive Theory of Multimedia Learning, see Mayer, 2014a). For example, humans have distinct information-processing channels for visual and auditory content (Mayer, 2014a), which has been studied in the influence of captions and subtitles (Schlippe et al., 2023;Tarchi et al., 2021) or human vs. computer-generated voices (Craig & Schroeder, 2019). Other research has investigated human vs. virtual instructors (e.g., Davis, 2018), finding that emotional states displayed by human instructors were easier to detect for learners than emotional states by virtual instructors and positively affected students' emotional and motivational states (Horovitz & Mayer, 2021). ...
Article
Full-text available
Background: Research on on-screen instructor videos in education highlighted the role of embodied social cues for students' interest and motivation. As essential components of nonverbal communication variations of instructors' body postures may enhance teaching and stimulate learning by affecting students' perception and attitudes. // Aims: We investigate how an instructor's posture influence students' perceptions of and their attitudes towards an instructor in a video, as well as their interest and motivation regarding the topic. // Sample: University students participated online in a pilot (N = 194), a complementary (audio track-comparison; N = 53), and a preregistered (N = 434) experiment. // Methods: Participants were randomly assigned to watch one of four videos in which the instructor's posture was varied regarding verticality (upright vs. slumped) and horizontality (open vs. closed). We assessed students' perceptions of the instructor's enthusiasm, agency, and communion, liking and respect for the instructor, situational interest and motivation. // Results: While perceived enthusiasm, agency, communion, and students' liking were affected by the vertical and the horizontal dimension, students' respect was only influenced by the horizontal dimension. Regarding situational interest and motivation, we found indirect-only mediation effects of both posture dimensions mediated through perceived enthusiasm. Further mediation analyses indicated that the vertical dimension affected respect indirectly and the horizontal dimension affected liking, both mediated through perceptions of agency and communion. // Conclusions: Our study demonstrates that instructor's body postures as embodied social cues in educational videos affect students' perceptions of and attitudes towards the instructor, which in turn, shape students' interest and motivation.
Conference Paper
Full-text available
Usually employers, job seekers and educational institutions use AI in isolation from one another. However, skills are the common ground between these three parties which can be analyzed with the help of AI: (1) Employers want to automatically check which of their required skills are covered by appli-cants' CVs and know which courses their employees can take to acquire missing skills. (2) Job seekers want to know which skills from job postings are missing in their CV, and which study programs they can take to acquire missing skills. (3) In addition, educational institutions want to make sure that skills required in job postings are covered in their curricula and they want to recommend study programs. Consequently, we investigated several natural language processing techniques to extract, vectorize, cluster and compare skills, thereby connecting and supporting employers, job seekers and educational institutions. Our application Skill Scanner uses our best algorithms and outputs statistics and recommendations for all groups. The results of our survey demonstrate that the majority finds that with the help of Skill Scanner, processes related to skills are carried out more effectively, faster, fairer, more explainably, and in a more supported manner. 89% of all participants are not averse to apply our recommendation system for their tasks. 67% of job seekers would certainly use it.
Conference Paper
Full-text available
We describe our work on sentiment analysis for Hausa, where we investigated monolingual and cross-lingual approaches to classify student comments in course evaluations. Furthermore, we propose a novel stemming algorithm to improve accuracy. For studies in this area, we collected a corpus of more than 40,000 comments-the Hausa-English Sentiment Analysis Corpus For Educational Environments (HESAC). Our results demonstrate that the monolingual approaches for Hausa sentiment analysis slightly outperform the cross-lingual systems. Using our stemming algorithm in the pre-processing even improved the best model resulting in 97.4% accuracy on HESAC.
Chapter
Full-text available
Massive open online courses and other online study opportunities are providing easier access to education for more and more people around the world. However, one big challenge is still the language barrier: Most courses are available in English, but only 16% of the world’s population speaks English [1]. The language challenge is especially evident in written exams, which are usually not provided in the student’s native language. To overcome these inequities, we analyze AI-driven cross-lingual automatic short answer grading. Our system is based on a Multilingual Bidirectional Encoder Representations from Transformers model [2] and is able to fairly score free-text answers in 26 languages in a fully-automatic way with the potential to be extended to 104 languages. Augmenting training data with machine translated task-specific data for fine-tuning even improves performance. Our results are a first step to allow more international students to participate fairly in education.KeywordsCross-lingual automatic short answer gradingArtificial intelligence in educationNatural language processingDeep learning
Chapter
Full-text available
We investigate and compare state-of-the-art deep learning techniques for Automatic Short Answer Grading. Our experiments demonstrate that systems based on the Bidirectional Encoder Representations from Transformers (BERT) [1] performed best for English and German. Our system achieves a Pearson correlation coefficient of 0.73 and a Mean Absolute Error of 0.4 points on the Short Answer Grading data set of the University of North Texas [2]. On our German data set we report a Pearson correlation coefficient of 0.78 and a Mean Absolute Error of 1.2 points. Our approach has the potential to greatly simplify the life of proofreaders and to be used for learning systems that prepare students for exams: 31% of the student answers are correctly graded and in 40% the system deviates on average by only 1 point out of 6, 8 and 10 points.KeywordsAutomatic short answer gradingArtificial intelligence in educationNatural language processingDeep learning
Chapter
Full-text available
Our previous analysis on 26 languages which represent over 2.9 billion speakers and 8 language families demonstrated that cross-lingual automatic short answer grading allows students to write answers in exams in their native language and graders to rely on the scores of the system [1]. With lower deviations than 14% (0.72 points out of 5 points) on the corpus of the short answer grading data set of the University of North Texas [2], our natural language processing models show better performances compared to the human grader variability (0.75 points, 15%). In this paper we describe our latest analysis of the integration and application of a multilingual model in interactive training programs to optimally prepare students for exams. We present a multilingual interactive conversational artificial intelligence tutoring system for exam preparation. Our approach leverages and combines learning analytics, crowdsourcing and gamification to automatically allow us to evaluate and adapt the system as well as to motivate students and increase their learning experience. In order to have an optimal learning effect and enhance the user experience, we also tackle the challenge of explainability with the help of keyword extraction and highlighting techniques. Our system is based on Telegram since it can be easily integrated into massive open online courses and other online study systems and has already more than 400 million users worldwide [3].
Article
Full-text available
Recent developments have seen a significant increase in the number of educational videos being made, mostly for use as a resource in a range of educational levels and different specializations. Indeed, currently, many universities either provide videos as supplementary resources or, indeed, offer entire courses as online learning materials. The qualitative study this paper presents was conducted to answer the following questions: “How have educational videos (lectures/tutorials) published on YouTube affected the university students’ studies at both postgraduate and undergraduate levels?” and “would it be better to upload these types of videos onto a university website?” The aim was to explore the experiences of students from two universities (one a high- ranking university and one from a developing country) regarding online educational videos and to assess the extent to which these kinds of videos influence their studies. The data collection method used was individual interviews with students from two different universities to gather their perspectives, their opinions, and their aspirations regarding such videos. The results section analyzes and discusses the students’ varied opinions. Based on the research findings, several recommendations are made to develop a useable design to add videos to university websites. Finally, the research discussed how this study’s findings contribute during the COVID-19 pandemic.
Conference Paper
Full-text available
We present the concept of an intelligent tutoring system which combines web search for learning purposes and state-of-the-art natural language processing techniques. Our concept is described for the case of teaching information literacy, but has the potential to be applied to other courses or for independent acquisition of knowledge through web search. The concept supports both, students and teachers. Furthermore, the approach integrates issues like AI explainability, privacy of student information, assessment of the quality of retrieved information and automatic grading of student performance.
Conference Paper
Full-text available
Diversification of fonts in video captions based on the voice characteristics, namely loudness, speed and pauses, can affect the viewer receiving the content. This study evaluates a new method, WaveFont, which visualizes the voice characteristics for captions in an intuitive way. The study was specifically designed to test captions, which aims to add a new experience for Arabic viewers. The results indicate that our visualization is comprehensible and acceptable and provides significant added value—for hearing-impaired and non-hearing impaired participants: Significantly more participants stated that WaveFont improves their watching experience more than standard captions.
Article
Full-text available
This study analyzed four widely used videoconferencing systems: Zoom, Skype, Microsoft Teams, and WhatsApp. Using experiential e-learning as the framework for analysis, this study examined the systems' general characteristics, learning-related features, and usability. We conducted an analytical evaluation and analyzed system features in regard to their impact on the quality of the online educational experience. The results of this analysis provide guidance for selecting effective videoconferencing systems to support learning. They also offer insights on ways to explore teaching approaches and pedagogies for distance education. This paper offers a set of recommendations as well as suggestions for videoconferencing system improvements.
Article
Full-text available
The purpose of this study was to assess the impact of Artificial Intelligence (AI) on education. Premised on a narrative and framework for assessing AI identified from a preliminary analysis, the scope of the study was limited to the application and effects of AI in administration, instruction, and learning. A qualitative research approach, leveraging the use of literature review as a research design and approach was used and effectively facilitated the realization of the study purpose. Artificial intelligence is a field of study and the resulting innovations and developments that have culminated in computers, machines, and other artifacts having human-like intelligence characterized by cognitive abilities, learning, adaptability, and decision-making capabilities. The study ascertained that AI has extensively been adopted and used in education, particularly by education institutions, in different forms. AI initially took the form of computer and computer related technologies, transitioning to web-based and online intelligent education systems, and ultimately with the use of embedded computer systems, together with other technologies, the use of humanoid robots and web-based chatbots to perform instructors’ duties and functions independently or with instructors. Using these platforms, instructors have been able to perform different administrative functions, such as reviewing and grading students’ assignments more effectively and efficiently, and achieve higher quality in their teaching activities. On the other hand, because the systems leverage machine learning and adaptability, curriculum and content has been customized and personalized in line with students’ needs, which has fostered uptake and retention, thereby improving learners experience and overall quality of learning.
Article
Full-text available
Type is not expressive enough. Even the youngest speakers are able to express a full range of emotions with their voice, while young readers read aloud monotonically as if to convey robotic boredom. We augmented type to convey expression similarly to our voices. Specifically, we wanted to convey in text words that are spoken louder, words that drawn out and spoken longer, and words that are spoken at a higher pitch. We then asked children to read sentences with these new kinds of type to see if children would read these with greater expression. We found that children would ignore the augmentation if they weren’t explicitly told about it. But when children were told about the augmentation, they were able to read aloud with greater vocal inflection. This innovation holds great promise for helping both children and adults to read aloud with greater expression and fluency.
Article
Full-text available
A growing body of research explores emoji, which are visual symbols in computer mediated communication (CMC). In the 20 years since the first set of emoji was released, research on it has been on the increase, albeit in a variety of directions. We reviewed the extant body of research on emoji and noted the development, usage, function, and application of emoji. In this review article, we provide a systematic review of the extant body of work on emoji, reviewing how they have developed, how they are used differently, what functions they have and what research has been conducted on them in different domains. Furthermore, we summarize directions for future research on this topic.
Article
Full-text available
The use of videos as learning objects has increased together with an increased variation in the designs of these educational videos. However, to create effective learning objects it is important to have detailed information about how users perceive and interact with the different parts of the multimedia design. In this paper we study, using eye-tracker technology, how fast and for how long viewers focus on captions and written text in a video. An educational video on thermodynamics was created where captions were used to highlight important concepts. Screen recordings of written text from a tablet were used to illustrate mathematical notations and calculations. The results show that there is a significant delay of about 2 seconds before viewers focus on graphical objects that appear, both for captions and for written text. For captions, the viewers focus on the element for 2-3 seconds, whereas for written text blocks, it is strongly dependent on the amount and quality of the presented information. These temporal aspects of the viewers’ attention will be important for the proper design of educational videos to achieve appropriate synchronization between graphical objects and narration and thereby supporting learning.
Article
Full-text available
Despite the ubiquitous use of videos in online learning and vast literature on designing online learning, questions remain largely unanswered on what pedagogical strategies and production guidelines should be used to design and develop instructional videos for effective student learning. In this study, the instructors of an online graduate course experimented with a 10-principle model of incorporating four pedagogical strategies, four instructional phases, and two production guidelines in designing and developing video lessons. Feedback was collected from students through course survey on their perceptions of the effectiveness of the video lessons for eight semesters since the course was first offered in Fall 2014. This paper shares the instructors’ experience on the design and development of the video lessons as well as the survey findings. Implications of the findings for instructional design and future research are also discussed.
Article
Full-text available
Video captions, also known as same-language subtitles, benefit everyone who watches videos (children, adolescents, college students, and adults). More than 100 empirical studies document that captioning a video improves comprehension of, attention to, and memory for the video. Captions are particularly beneficial for persons watching videos in their non-native language, for children and adults learning to read, and for persons who are D/deaf or hard of hearing. However, despite U.S. laws, which require captioning in most workplace and educational contexts, many video audiences and video creators are naïve about the legal mandate to caption, much less the empirical benefit of captions.
Conference Paper
Full-text available
With voice driven type design (VDTD), we introduce a novel concept to present written information in the digital age. While the shape of a single typographical character has been treated as an unchangeable property until today, we present an innovative method to adjust the shape of each single character according to particular acoustic features in the spoken reference. Thereby, we allow to keep some individuality and to gain additional value in written text which offers different applications – providing meta-information in subtitles and chats, supporting deaf and hearing impaired people, illustrating intonation and accentuation in books for language learners, giving hints how to sing – up to artistic expression. By conducting a user study we have demonstrated that – using our proposed approach – loudness, pitch and speed can be represented visually by changing the shape of each character. By complementing homogeneous type design with these parameters, the original intention and characteristics of the speaker (personal expression and intonation) are better supported.
Conference Paper
Full-text available
Subtitles (closed captions) on television are typically placed at the bottom-centre of the screen. However, placing subtitles in varying positions, according to the underlying video content ('dynamic subtitles'), has the potential to make the overall viewing experience less disjointed and more immersive. This paper describes the testing of such subtitles with hearing-impaired users, and a new analysis of previously collected eye-tracking data. The qualitative data demonstrates that dynamic subtitles can lead to an improved User Experience, although not for all types of subtitle user. The eye-tracking data was analysed to compare the gaze patterns of subtitle users with a baseline of those for people viewing without subtitles. It was found that gaze patterns of people watching dynamic subtitles were closer to the baseline than those of people watching with traditional subtitles. Finally, some of the factors that need to be considered when authoring dynamic subtitles are discussed.
Conference Paper
Full-text available
Videos are a widely-used kind of resource for online learning. This paper presents an empirical study of how video production decisions affect student engagement in online educational videos. To our knowledge, ours is the largest-scale study of video engagement to date, using data from 6.9 million video watching sessions across four courses on the edX MOOC platform. We measure engagement by how long students are watching each video, and whether they attempt to answer post-video assessment problems. Our main findings are that shorter videos are much more engaging, that informal talking-head videos are more engaging, that Khan-style tablet drawings are more engaging, that even high-quality pre-recorded classroom lectures might not make for engaging online videos, and that students engage differently with lecture and tutorial videos. Based upon these quantitative findings and qualitative insights from interviews with edX staff, we developed a set of recommendations to help instructors and video producers take better advantage of the online video format. Finally, to enable researchers to reproduce and build upon our findings, we have made our anonymized video watching data set and analysis scripts public. To our knowledge, ours is one of the first public data sets on MOOC resource usage.
Article
Full-text available
The author investigated caption use, sound, and the reading behavior of 76 children who had just completed 2nd grade. The present study indicated that beginning readers recognize more words when they view television that uses captions. The auditory element was important for comprehension tasks related to incidental elements and spontaneous use of target words, and the combination of captions and sound helped children identify the critical story elements in the video clips. Positive beliefs about one's competence in reading or watching television appeared to facilitate the recognition of words and, for boys, improve their oral reading rates. In sum, television captions, by evoking efforts to read, appeared to help a child focus on central story elements and away from distracting information, including sound effects and visual glitz. Implications are discussed. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Conference Paper
Full-text available
Many Deaf and Hearing Impaired people use subtitles to gain access to audio content on television and film presentations. Although subtitles tell the viewer what is being said they fail to communicate how it is being said. This "emotional gap" experienced by viewer's highlights a significant drawback to current subtitling especially when used for learning by the Deaf. In this paper we introduce a system that demonstrates the presentation of subtitles that depict the emotions behind the words used on screen. The system also provides viewers with the ability to personalize and adapt their interaction with subtitles, so as to assist them in their learning. Using this system we hope to conduct a series of surveys looking at how people receive and use subtitles. In conducting this research we aim to gain a comprehensive understanding of the issues associated with emotional subtitling and to provide guidance for future producers of subtitled materials.
Chapter
Usually employers, job seekers and educational institutions use AI in isolation from one another. However, skills are the common ground between these three parties which can be analyzed with the help of AI. Employers want to automatically check which of their required skills are covered by applicants’ CVs and know which courses their employees can take to acquire missing skills. Job seekers want to know which skills from job postings are missing in their CV and which study programs they can take to acquire missing skills. In addition, educational institutions want to make sure that skills required in job postings are covered in their curricula, and they want to recommend study programs. Consequently, we investigated several natural language processing techniques to extract, vectorize, cluster and compare skills, thereby connecting and supporting employers, job seekers and educational institutions. Our application Skill Scanner uses our best algorithms and outputs statistics and recommendations for all groups. The results of our survey demonstrate that the majority finds that with the help of Skill Scanner, processes related to skills are carried out more effectively, faster, fairer, more explainably, and in a more supported manner. In total, 89% of all participants are not averse to apply our recommendation system for their tasks, and 67% of job seekers would certainly use it.
Chapter
Skills are the common ground between employers, job seekers and educational institutions which can be analyzed with the help of natural language processing (NLP) techniques. In this paper we explore a state-of-the-art pipeline that extracts, vectorizes, clusters, and compares skills to provide recommendations for all three parties—thereby bridging the gap between employers, job seekers and educational institutions. Our best system combines Sentence-BERT [1], UMAP [2], DBSCAN [3], and K-means clustering [4].KeywordsAI in educationRecommender systemRecommendation systemUp-skillingNatural language processing
Chapter
Although corresponding technological and didactical models have been known for decades, the digitization of teaching has hardly advanced beyond simple non-interactive formats (e.g. downloadable slides are provided within a learning management system). The COVID-19 crisis is changing this situation dramatically, creating a high demand for highly interactive formats and fostering exchange between conversation partners about the course content. Systems are required that are able to communicate with students verbally, to answer their questions, and to check the students’ knowledge. While technological advances have made such systems possible in principle, the game stopper is the large amount of manual work and knowledge that must be put into designing such a system and feeding it the right content. In this publication, we present a first system to overcome the aforementioned drawback by automatically generating a corresponding dialog system from slide-based presentations, such as PowerPoint, OpenOffice, or Keynote, which can be dynamically adapted to the respective students and their needs. Our first experiments confirm the proof of concept and reveal that such a system can be very handy for both respective groups, learners and lecturers, alike. The limitations of the developed system, however, also reminds us that many challenges need to be addressed to improve the feasibility and quality of such systems, in particular in the understanding of semantic knowledge.
Article
An increasing number of instructional videos online integrate a real instructor on the video screen. So far, the empirical evidence from previous studies has been limited and conflicting, and none of the studies have explored how learners' allocation of visual attention to the on-screen instructor influences learning and learner perceptions. Therefore, this study aimed to disentangle a) how instructor presence in online videos affects learning, learner perceptions (i.e., cognitive load, judgment of learning, satisfaction, situational interest), and visual attention distribution and b) to what extent visual attention patterns in instructor-present videos predict learning and learner perceptions. Sixty college students each watched two videos on Statistics, one on an easy topic and the other one on a difficult topic, with each in one of the two video formats: instructor-present or instructor-absent. Their eye movements were simultaneously registered using a desktop-mounted eye tracker. Afterwards, participants self-reported their cognitive load, judgment of learning, satisfaction, and situational interest for both videos, and feelings toward seeing the instructor for the instructor-present videos. Learning from the two videos was measured using retention and transfer questions. Findings indicated instructor presence a) improved transfer performance for the difficult topic, b) reduced cognitive load for the difficult topic, c) increased judgment of learning for the difficult topic, and d) enhanced satisfaction and situational interest for both topics. Most participants expressed a positive feeling toward the instructor. Results also showed the instructor attracted a considerable amount of overt visual attention in both videos, and the amount of attention allocated to the instructor positively predicted participants’ satisfaction level for both topics.
Chapter
The psychological construct of style in personality, cognition, and learning is explained in this article. The development of a styles theory is interpreted as the evolution of a generic concept of individuality and its status as an individual difference in cognition and learning. The emergence of popular applications of learning styles as well as a wave of critical revisionism and new directions in researching style differences is related to this development. An application of style differences for lifelong learning in both education and workplace are then considered, including implications for further study of differential psychology, pedagogy, training, and the nature of an individual's personal approach to learning (style).
Article
In an experimental study, we analyzed the cognitive processing of a subtitled film excerpt by adopting a methodological approach based on the integration of a variety of measures: eye-movement data, word recognition, and visual scene recognition. We tested the hypothesis that the processing of subtitled films is cognitively effective: It leads to a good understanding of film content without requiring a significant tradeoff between image processing and text processing. Following indications in the psycholinguistic literature, we also tested the hypothesis that two-line subtitles whose segmen-tation is syntactically incoherent can have a disruptive effect on information processing and recognition performance. The results highlighted the effectiveness of subtitle processing: Regardless of the quality of line segmentation, participants had a good understand-ing of the film content, they achieved good levels of performance in both word and scene recognition, and no tradeoff between text and image processing was detected. Eye-movement analyses enabled a further characterization of cognitive processing during subtitled film viewing. This article discusses the theoretical im-plications of the findings for both subtitling and multiple-source communication and highlights their methodological and applied implications.
Conference Paper
Due to limitations of conventional text-based closed captions, expressions of paralinguistic and emotive information contained in the dialogue of television and film content are often missing. We present a framework for enhancing captions that uses animation and a set of standard properties to express five basic emotions. Using an action research method, the framework was developed from a designer’s interpretation and rendering of animated text captions for two content examples.
Conference Paper
The current method for speaker identification in closed captioning on television is ineffective and difficult in situations with multiple speakers, off-screen speakers, or narration. An enhanced captioning system that uses graphical elements (e.g., avatar and colour), speaker names and caption placement techniques for speaker identification was developed. A comparison between this system and conventional closed captions was carried out deaf and hard-of-hearing participants. Results indicate that viewers are distracted when the caption follows the character on-screen regardless of whether this should assist in identifying who is speaking. Using the speaker’s name for speaker identification is useful for viewers who are hard of hearing but not for deaf viewers. There was no significant difference in understanding, distraction, or preference for the avatar with the coloured border component.
ASCCC Open Educational Resources Initiative (OERI)
  • J Marteney
Generation of text from an audio speech signal
  • T Schlippe
  • M Wölfel
  • A Stitz
Captioned Media: Teacher Perceptions of Potential Value for Students with No Hearing Impairments: A National Survey of Special Educators. Described and Captioned Media Program
  • F G Bowe
  • A Kaufman
Conveying emotions in Arabic SDH: The case of pride and prejudice
  • G El-Taweel
Integrated titles: An improved viewing experience
  • W Fox
Hidden bawls, whispers, and yelps: Can text be made to sound more than just its words
  • C De Lacerda Pataca
  • P Dornhofer Paro Costa