AI-based Multilingual Interactive Exam Preparation
Tim Schlippe and Jörg Sawatzki
IU International University of Applied Sciences
Abstract. Our previous analysis on 26 languages which represent over 2.9 bil-
lion speakers and 8 language families demonstrated that cross-lingual automatic
short answer grading allows students to write answers in exams in their native
language and graders to rely on the scores of the system . With lower devia-
tions than 14% (0.72 points out of 5 points) on the corpus of the short answer
grading data set of the University of North Texas , our investigated natural
language processing (NLP) models show better performances compared to the
human grader variability (0.75 points, 15%). In this paper we describe our latest
analysis of the integration and application of these NLP models in interactive
training programs to optimally prepare students for exams: We present a multi-
lingual interactive conversational artificial intelligence tutoring system for ex-
am preparation. Our approach leverages and combines learning analytics,
crowdsourcing and gamification to automatically allow us to evaluate and adapt
the system as well as to motivate students and increase their learning experi-
ence. In order to have an optimal learning effect and enhance the user experi-
ence, we also tackle the challenge of explainability with the help of keyword
extraction and highlighting techniques. Our system is based on Telegram since
it can be easily integrated into massive open online courses and other online
study systems and has already more than 400 million users worldwide .
Keywords: artificial intelligence in education, cross-lingual short answer grad-
ing, conversational AI, keyword extraction, natural language processing.
Access to education is one of people's most important assets and ensuring inclusive
and equitable quality education is goal 4 of United Nation's Sustainable Development
Goals . Massive open online courses and other online study opportunities are
providing easier access to education for more and more people around the world.
However, one big challenge is still the language barrier: Most courses are offered in
English, but only 16% of the world population speaks English . To reach the rest of
the people with online study opportunities, courses would need to better support more
languages. The linguistic challenge is especially evident in written exams, which are
usually not provided in the student's native language. To overcome these inequities,
we present and analyze a multilingual interactive conversational artificial intelligence
tutoring system for exam preparation (multilingual exam trainer).
Our system is based on a Multilingual Bidirectional Encoder Representations from
Transformers model (M-BERT)  and is able to fairly score free-text answers in 26
languages in a fully-automatic way (en, ceb, sv, de, fr, nl, ru, it, es, pl, vi, ja, zh, ar,
uk, pt, fa, ca, sr, id, no, ko, fi, hu, cs, sh) . Thus, foreign students have the possibil-
ity to write answers in their native language during the exam preparation. Since our
used multilingual NLP model has been pre-trained with a total of 104 languages, our
exam trainer can be easily extended with new languages.
Fig. 1 illustrates the concept of our multilingual exam trainer: Iteratively, an exam
question is displayed to the user. The user enters the answer (student answer) using
our chatbot interface. Then the student answer is processed with two AI models–the
multilingual automatic short answer grading (ASAG) model and the keyword match-
ing model–which deliver quantitative feedback in terms of a score and qualitative
feedback by displaying the model answer and highlighting the keywords matching the
student answer and the model answer.
Fig. 1. Concept of the AI-based Interactive Exam Preparation.
To evaluate our approach, we conducted a study where students, former students, and
people who enjoy continuing education used our implementation and then completed
In the next section, we present the latest approaches of other researchers for the
components of our multilingual exam trainer. Sec. 3 describes our specific implemen-
tation. Sec. 4 delineates our experimental setup. Our experiments and results are out-
lined in Sec. 5. We conclude our work in Sec. 6 and suggest further steps.
2 Related Work
The research area “AI in Education” addresses the application and evaluation of Arti-
ficial Intelligence (AI) methods in the context of education and training . One of
the main focuses of this research is to analyze and improve teaching and learning
processes with natural language processing (NLP) models. In the following sections,
we describe the use of NLP components in related work for multilingual NLP, ASAG,
conversational AI, and keyword extraction to address the challenges of our system.
2.1 Multilingual Natural Language Processing Models
To allow users of our system to answer the exam questions in their native language,
we used of a multilingual NLP model and adapted it to the task of ASAG. Multilin-
gual NLP models are provided by multiple institutions, e.g., M-BERT , ROBERTa
model , or XLM-R . They have the benefit that they can be adapted to a certain
task with task-specific labeled text data in 1 or more languages (transfer learning) and
then perform this learned task in other languages (cross-lingual transfer) .
To give the users of our system qualitative feedback on their answers, we used M-
BERT as the basic multilingual model which is “pre-trained from monolingual corpo-
ra in 104 languages”  and adapted it to the task of cross-lingual ASAG.
2.2 Automatic Short Answer Grading
ASAG helps us provide feedback on the student answer in the form of a score. The
field of ASAG is becoming more relevant since many educational institutions–public
and private–already conduct their courses and examinations online [1,12].
A good overview of approaches in ASAG before the deep learning era is given in
.  investigate and compare state-of-the-art deep learning techniques for
ASAG. Their experiments demonstrate that systems based on BERT performed best
for English and German.  report that their multilingual ROBERTa model 
shows a stronger generalization across languages on English and German.
We extend ASAG to 26 languages and use the smaller M-BERT  model to
conduct a larger study concerning the cross-lingual transfer .
2.3 Conversational AI
For the interaction with the users, we used a conversational AI that takes the input
from the users and sends messages based on a dialog flow. The messages of the con-
versational AI contain the exam question, the student answer score, the model answer
with highlighted keywords, information about the progress and motivations.
Conversational assistants in education enable learners to access data and services
and exchange information by simulating human-like conversations in the form of a
natural language dialogue on a given topic . There are various technologies,
frameworks, and services for building a conversational AI, such as Rasa , Google
DialogFlow  or Telegram .
Our conversational AI is based on Telegram  since it can be easily integrated
into massive open online courses and other online study systems and has already more
than 400 million users worldwide . However, it can be ported to other chatbot
technologies as well. To provide our conversational AI’s messages in the students’
native languages, we translated them into our 26 languages using Google's Neural
Machine Translation System . An overview of the system’s BLEU scores over
languages is given in . We did not post-correct the translations, as we wanted to
check whether our system from scratch already delivers a good user interface in dif-
2.4 Keyword Extraction and Semantic Similarity
To explain our users the difference between student answer and model answer, we
highlight the keywords and their synonyms which are contained in both the student
answer and the model answer. This combines two tasks: Keyword extraction and
Good overviews of automatic keyword and keyphrase extraction are provided in
 and . A survey of the evolution in semantic similarity is given in . The
latest trend for both tasks is to embed the words into a semantic vector space thus
working with word embeddings since the semantically similar words are located
nearby in vector space.
In our system SpaCy  is used to exclude stop words, convert the remaining
words into vectors and compute the word similarities.
3 AI-based Interactive Exam Preparation
In this section we describe what components our multilingual exam trainer consists of
and how they were implemented.
3.1 Dialog Flow of our Conversational AI
Fig. 2 shows the dialog flow of our exam trainer with the following steps:
1. The user activates the conversational AI with the /start command.
2. The conversational AI welcomes the user and presents a list of 26 languages
to select from.
3. The conversational AI asks the user a question in the selected language.
4. The user types the answer (any of the 104 languages used in M-BERT is
5. The conversational AI gives feedback in terms of a score and highlights sim-
ilarities between student and model answer.
6. If the total points collected are equal or greater than THRESHOLD, the goal
is reached and the game ends.
7. Otherwise, the user is presented with another student answer that he or she
needs to score, considering the given model answer.
8. Proceed with step 3.
Fig. 2. Dialog Flow of the AI-based Interactive Exam Preparation.
3.2 Gamification and Motivation
Users have the motivation to use our multilingual exam trainer to improve answering
open exam questions. However, studies have shown that gamification creates another
incentive in learning . To give the users of our system this further incentive, we
came up with the following gamification approach: Users are in space and have the
goal to fly with their spaceship from Earth to Mars. To get closer to Mars with the
spaceship, the users have to answer the displayed exam questions. The points for the
answers are converted into kilometers. With better answers, the users get more points
and get to Mars faster. Based on the achievement in the student answer, the user is
praised and motivated by certain phrases, e.g., “Awesome, that gives us fuel for 3
million more kilometers” and with information of the distance to go. Fig. 3 illustrates
our gamification in the conversation between a Dutch user and our conversational AI.
3.3 Quantitative Feedback: Multilingual Automatic Short Answer Grading
The AI model which processes the student answers in their native language and deliv-
ers the user with quantitative feedback in terms of a score is based on M-BERT. The
model was downloaded and fine-tuned through the Transformers library. We trained 6
epochs with a batch size of 8 using the AdamW optimizer with an initial learning rate
of 0.00004. We supplemented each fine-tuned BERT model with a linear regression
layer that outputs a prediction of the score given an answer. The model expects the
model answer and the student answer as input.
Fig. 3. Conversation with greeting, language selection, exam question,
student answer, scoring, model answer and motivation.
The ASAG data set of the University of North Texas  provided the exam questions,
model answers and training data for fine-tuning M-BERT. It contains 87 questions
with corresponding model answer and on average 28.1 manually graded answers per
question about the topic Data Structures from undergraduate studies.
After fine-tuning with this original English ASAG data, our model would be able
to receive a model answer together with a student answer in 1 of the other 103 lan-
guages and return a score in terms of points–without the need of fine-tuning with
ASAG data in the other languages (cross-lingual transfer). However, since we figured
out that adding translations of the ASAG data in more languages even improves fine-
tuning, we added translations in the 5 languages German, Dutch, Finnish, Japanese,
and Chinese . With Mean Absolute Errors between 0.41 and 0.72 points out of 5
points in our analysis of the 26 covered languages, our model has even less discrepan-
cy than the 2 graders which graded the ASAG corpus of the University of North Tex-
as with a discrepancy of 0.75 points .
To provide the exam questions and the model answers in our multilingual exam
trainer in 26 languages and to get the translations in the 5 listed languages for fine-
tuning M-BERT, we used Google's Neural Machine Translation System .
Google’s Neural Machine Translation System is also used by other researchers who
experiment with multilingual NLP models since it comes close to the performance of
professional translators .
3.4 Qualitative Feedback: Keyword Extraction and Highlighting
Fig. 5 shows the keyword highlighting in a snippet of the conversation between the
user and the chatbot. For simplicity, we have implemented keyword extraction and
highlighting for English only in our prototype. Porting the method to other languages
is possible using word vectors and a distance measure.
Our algorithm for keyword extraction and highlighting is shown in Fig. 6. Given
are the word tokens of the model answer and the word tokens of the student answer.
Fig. 5. Conversation with Keyword Highlighting.
# Iterate through all tokens in model answer
for model_token in model_answer:
# Process only tokens not in stop word list and alphanumeric
if not model_token.is_stop and model_token.is_alpha:
# Iterate through tokens in the student's answer
for answer_token in student_answer:
# If answer token is not a stop word and alphanumeric:
# Highlight tokens if
# their vectors' cosine similarity exceeds given threshold
if not answer_token.is_stop and model_token.is_alpha and \
model_token.similarity(answer_token) > THRESHOLD:
Fig. 6. Algorithm for keyword extraction and highlighting.
We iterate over the word tokens in the model answer and over the word tokens in the
student answer and remove stop words and word tokens which do not contain alpha-
numerical characters. Each remaining word token of the model answer and the student
answer is converted into a word vector. Then the word vectors of the model answer
are compared with the word vectors of the student answer. If the similarity between
two word vectors is lower than a threshold, we consider them as synonyms.
3.5 Crowdsourcing and Peer-Reviewing
In order to continuously improve our multilingual ASAG model with high quality
human labeled training data in a crowdsourcing approach, the user also has the task of
scoring another student's answer as part of the game (step 7 in Sec. 3.1). Studies such
as  have shown that peer-based proofreading is as effective as a professional
proofreader. Consequently, the same student answer is demonstrated to different us-
ers. This peer-review process makes it possible to detect and filter outliers which
would have a negative impact on the model. However, this process also has another
advantage for the user: The student deals with the question again, but this time from a
4 Experimental Setup
In this section, we describe the structure of our questionnaire and the participants.
4.1 Questionnaire Design
To evaluate our approach, we conducted a study where students, former students, and
people who enjoy continuing education first tried our exam trainer and then complet-
ed a questionnaire. The study was conducted on a subset of the possible languages
and examined 5 different aspects: Learning experience, user experience, motivation,
quality of NLP models, and gamification. Our questionnaire contains the following
1. General questions about the scenario of a multilingual interactive conversa-
tional AI tutoring system for exam preparation.
2. Specific questions concerning our implementation.
3. Specific questions concerning extensions and improvements.
4. Personal questions (profile and demographic information)
To obtain detailed results, we asked for a score range where it makes sense. The score
range follows the rules of a forced choice Likert scale, which ranges from (1) strongly
disagree to (5) strongly agree.
51 people from 6 countries filled out our questionnaire, giving us a first impression of
the quality and impact of our system. Most were students from the University of Os-
nabrück, IU International University of Applied Sciences, Karlsruhe Institute of
Technology, and Karlsruhe University of Applied Sciences. These people tested our
exam trainer in German (64.7%), English (21.6%), Dutch (3.9%), Italian (3.9%),
French (3.9%), and Spanish (2.0%).
5 Experiments and Results
As described, our study examined 5 different aspects: Learning experience, user expe-
rience, motivation, quality of NLP models, and gamification.
5.1 Learning Experience
Fig. 7 shows that participants responded positively to questions about improving the
learning experience, meaningfulness, use, and helping fellow students–both in general
and for our implementation. The majority also believe that our implementation can
accelerate the learning process and that scoring other students’ answers is helpful.
There is a more divided opinion on the questions whether the exam trainer is good to
get familiar with the subject in the native language first, when the actual exam is in
English anyway, and whether it can help elderly people to study online. The differ-
ence in distribution for the last question about support for elderly people shows that
most participants generally rate it as "neutral", while our system scores a bit lower.
This feedback plus comments from the participants on this topic lead us to believe
that it is possible to optimize such a system in cooperation with elderly people.
Fig. 7. Learning Experience.
Fig. 8. Motivation.
Fig. 9. User Experience.
Fig. 8 shows a tendency for such an exam trainer in general and for our implementa-
tion to motivate people to prepare more for exams.
5.3 User Experience
Fig. 9 indicates that the clear majority of participants find that our interface is easy to
use and that operating our exam trainer is fun.
5.4 Quality of Natural Language Processing Models
Fig. 10 shows that the clear majority of participants rates the machine-translated
questions as linguistically correct and understandable. This shows that post-correcting
the translations seems not to be necessary. The scoring with the help of the ASAG
was only rated average. Through the users' comments, we learned that many users had
randomly entered words as answers and received points for this. This was because
these random answers did not appear in the training data of our ASAG model and
therefore could not be scored correctly. The training data was taken from exams,
where usually no student dares to enter "I don't know" as an answer. Here we see
potential for improvement through the evaluation of other students and through
simple rules that evaluate such entries with 0 points. Our explainability approach with
keyword highlighting was well rated. However, we did not get as much feedback on it
because it was only implemented in English.
Fig. 11 illustrates the clear majority of participants who like the story and the theme
of the game. This demonstrates that even with a simple story–like the trip to Mars–
and without special graphical features, a good gamification can be created.
Fig. 10. Quality of NLP Models.
Fig. 11. Gamification.
6 Conclusion and Future Work
We presented a multilingual interactive conversational AI tutoring system for exam
preparation which combines multilingual NLP components, ASAG, conversational
AI, keyword extraction, learning analytics, crowdsourcing, and gamification. With
this multilingual exam trainer, we received positive feedback in a survey regarding
learning experience, user experience, motivation, quality of NLP models, and gamifi-
cation. The results of our survey support our proof-of-concept where users have tested
6 languages so far. Future work may include the extension to other languages. In ad-
dition, we would like to further address the issue of explainability to provide even
better support to the users of our multilingual exam trainer. To optimize the dialog, it
could be investigated how to create a more emotional dialog in written form, e.g., by
visualizing voice characteristics and emotions in the textual representation [26,27].
1. Sawatzki, J., Schlippe, T.: Cross-Lingual Automatic Short Answer Grading. In: AIED,
Utrecht, Netherlands (2021).
2. Mohler, M., Bunescu, R., Mihalcea, R.: Learning to Grade Short Answer Questions using
Semantic Similarity Measures and Dependency Graph Alignments. In: ACL-HLT, Port-
land, Oregon, USA. (2011).
3. Statista: Number of monthly active Telegram users worldwide from March 2014 to April
2020 (2021), https://www.statista.com/statistics/234038/telegram-messenger-mau-users
4. United Nations: Sustainable Development Goals: 17 Goals to Transform our World
5. Statista: The Most Spoken Languages Worldwide in 2019 (2020),
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding. In: NAACL-HLT, Minneapolis, Minne-
7. Libbrecht, P., Declerck, T., Schlippe, T., Mandl, T., Schiffner, D.: NLP for Student and
Teacher: Concept for an AI based Information Literacy Tutoring System. In: CIKM. Gal-
way, Ireland (2020).
8. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer,
L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach.
9. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave,
E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised Cross-lingual Representation
Learning at Scale. arXiv:1911.02116 (2018).
10. Pires, T., Schlinger, E., Garrette, D.: How Multilingual is Multilingual BERT? In: ACL.
Florence, Italy (2019).
11. Burrows, S., Gurevych, I., Stein, B.: The Eras and Trends of Automatic Short Answer
Grading. In: IJAIED 25, 60–117 (2014).
12. Camus, L., Filighera, A.: Investigating Transformers for Automatic Short Answer Grad-
ing. AIED. Cyberspace (2020).
13. Sawatzki, J., Schlippe, T., Benner-Wickner, M.: Deep Learning Techniques for Automatic
Short Answer Grading: Predicting Scores for English and German Answers. In: AIET,
Wuhan, China (2021).
14. Wölfel, M.: Towards the Automatic Generation of Pedagogical Conversational Agents
from Lecture Slides. In: EAI ICMTEL. Cyberspace (2021).
15. Bocklisch, T., Faulkner, J., Pawlowski, N., & Nichol, A.: Rasa: Open Source Language
Understanding and Dialogue Management. Cornell University. arXiv: 1712.05181 (2017).
16. Reyes, R., Garza, D., Garrido, L., De la Cueva, V., Ramirez, J.: Methodology for the Im-
plementation of Virtual Assistants for Education Using Google Dialogflow. In: MICAI.
Xalapa, Mexico (2019).
17. Setiaji, H., Paputungan, I.V.: Design of Telegram Bots for Campus Information Sharing.
In: ICITDA. Yogyakarta, Indonesia (2017).
18. Hasan, K.S.: Automatic Keyphrase Extraction: A Survey of the State of the Art. In: ACL.
Baltimore, Maryland, USA (2014)
19. Alami Merrouni, Z., Frikh, B., Ouhbi, B.: Automatic Keyphrase Extraction: A Survey and
Trends. In: JIIS. vol. 54, pp. 391–424 (2020).
20. Chandrasekaran, D., Mago, V.: Evolution of Semantic Similarity - A Survey. arXiv:
21. Honnibal, M., Montani, I. (n.d.). spaCy. https://spacy.io.
22. de Sousa Borges, S., Durelli, V. H. S., Reis, H. M., Isotani, S.: 2014. A Systematic Map-
ping on Gamification Applied to Education. In: SAC. New York, NY, USA (2014).
23. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao,
Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L.,
Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W.,
Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean,
J.: Google's Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation. CoRR abs/1609.08144 (2016).
24. Aiken, M.: An Updated Evaluation of Google Translate Accuracy. Studies in Linguistics
and Literature 3, 253 (2019).
25. Luo, H., Robinson, A., Park, J.-Y.: Peer Grading in a MOOC: Reliability, Validity, and
Perceived Effects. Online Learning: Official Journal of the Online Learning Consortium.
18. 1-14. 10.24059/olj.v18i2.429. (2014).
26. Wölfel, M., Schlippe, T., Stitz, A.: Voice Driven Type Design. In: SpeD. Bucharest, Ro-
27. Schlippe, T., Alessai, S., El-Taweel, G., Wölfel, M., Zaghouani, W.: Visualizing Voice
Characteristics with Type Design in Closed Captions for Arabic. In: Cyberworlds. Caen,