Article

Text classification by CEFR levels using machine learning methods and BERT language model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Automatic classification of language proficiency, particularly within the framework of CEFR, has received considerable research attention. A significant part of this research has explored machine learning techniques for text classification to assess proficiency levels, with models such as BERT and LSTM effectively used to detect CEFR levels by analyzing complex linguistic patterns in English ( [6] [7]). However, these models have mainly been applied without considering phraseological complexity, leaving a research gap for other languages. ...
Article
Full-text available
This paper presents a model for the automatic classification of writing proficiency in Italian as a second language (L2) according to the Common European Framework of Reference (CEFR) for languages. The proposed method integrates lexical and morphosyntactic quantitative analysis with phraseological dimensions. Phraseological aspects include the ability to use and understand fixed expressions, idioms, and other multiword units that are common in a language and reflect the depth of language comprehension typically manifested by native speakers. Specific techniques for encoding phraseological features have been introduced, and basic phraseological statistics, previously unavailable for Italy, have been extracted from an Italian corpus. The proposed model was experimentally compared with widely used machine-learning models using a dataset of written texts produced by non-native speakers for official Italian CEFR certification exams. The experimental results outperformed previous work on the CEFR classification of Italian L2 proficiency in terms of accuracy and all relevant prediction metrics, demonstrating the effectiveness of the proposed approach, which integrates morphosyntactic and phraseological features.
... Далее считаются -граммы частей речи, = 1, 2, 3, 4. Среди всех -грамм в корпусе выбирается 40 самых частых для каждого знначения , а для отдельных текстов считается частота встречаемости отдельных -грамм относительно отобранных. Описанные характеристики уже успешно показали себя в обработке текстов в предыдущих работах авторов [24]. ...
Article
The paper presents the results of a study of modern text models in order to identify, on their basis, the semantic similarity of English-language texts. The task of determining semantic similarity of texts is an important component of many areas of natural language processing: machine translation, information retrieval, question and answer systems, artificial intelligence in education. The authors solved the problem of classifying the proximity of student answers to the teacher’s standard answer. The neural network language models BERT and GPT, previously used to determine the semantic similarity of texts, the new neural network model Mamba, as well as stylometric features of the text were chosen for the study. Experiments were carried out with two text corpora: the Text Similarity corpus from open sources and the custom corpus, collected with the help of philologists. The quality of the problem solution was assessed by precision, recall, and F-measure. All neural network language models showed a similar F-measure quality of about 86% for the larger Text Similarity corpus and 50–56% for the custom corpus. A completely new result was the successful application of the Mamba model. However, the most interesting achievement was the use of vectors of stylometric features of the text, which showed 80% F-measure for the custom corpus and the same quality of problem solving as neural network models for another corpus.
Article
Full-text available
In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models.
Article
Full-text available
In automated essay scoring (AES) systems, similarity techniques are used to compute the score for student answers. Several methods to compute similarity have emerged over the years. However, only a few of them have been widely used in the AES domain. This work shows the findings of a ten-year review on similarity techniques applied in AES systems and discusses the efficiency and limitations of current methods. In the final review, thirty-four (34) articles published between 2010 and 2020 were included. The metrics used to evaluate the performance of the AES systems are also elaborated. The review was conducted using the Kitchenham method, whereby three research questions were formulated and a search strategy was developed. Research papers were chosen based on pre-defined inclusion and quality assessment criteria. This review has identified two types of similarity techniques used in AES systems. In addition, several methods were used to compute the score for student answers in the AES systems. The similarity computation in AES systems is dependent on several factors, hence many studies have combined multiple methods in a single system yielding good results. In addition, the review found that the quadratic weighted kappa (QWK) was most frequently used to evaluate AES systems.
Article
Full-text available
This paper focuses on automatically assessing language proficiency levels according to linguistic complexity in learner English. We implement a supervised learning approach as part of an automatic essay scoring system. The objective is to uncover Common European Framework of Reference for Languages (CEFR) criterial features in writings by learners of English as a foreign language. Our method relies on the concept of microsystems with features related to learner-specific linguistic systems in which several forms operate paradigmatically. Results on internal data show that different microsystems help classify writings from A1 to C2 levels (82% balanced accuracy). Overall results on external data show that a combination of lexical, syntactic, cohesive and accuracy features yields the most efficient classification across several corpora (59.2% balanced accuracy).
Article
Full-text available
This article is dedicated to the analysis of various stylometric characteristics combinations of different levels for the quality of verification of authorship of Russian, English and French prose texts. The research was carried out for both low-level stylometric characteristics based on words and symbols and higher-level structural characteristics. All stylometric characteristics were calculated automatically with the help of the ProseRhythmDetector program. This approach gave a possibility to analyze the works of a large volume and of many writers at the same time. During the work, vectors of stylometric characteristics of the level of symbols, words and structure were compared to each text. During the experiments, the sets of parameters of these three levels were combined with each other in all possible ways. The resulting vectors of stylometric characteristics were applied to the input of various classifiers to perform verification and identify the most appropriate classifier for solving the problem. The best results were obtained with the help of the AdaBoost classifier. The average F-score for all languages turned out to be more than 92 %. Detailed assessments of the quality of verification are given and analyzed for each author. Use of high-level stylometric characteristics, in particular, frequency of using N-grams of POS tags, offers the prospect of a more detailed analysis of the style of one or another author. The results of the experiments show that when the characteristics of the structure level are combined with the characteristics of the level of words and / or symbols, the most accurate results of verification of authorship for literary texts in Russian, English and French are obtained. Additionally, the authors were able to conclude about a different degree of impact of stylometric characteristics for the quality of verification of authorship for different languages.
Article
Full-text available
Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers' student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.
Article
Full-text available
The article is devoted to comparison of stylometric features of several levels, which are markers of the style of the prose text and analysis of the stylistic changes in Russian and British prose of the 19th-21st centuries. Stylometric features include the low-level features based on the words and symbols and high-level based on rhythmic. These features model the style of a text and are the indicators of the time when the text was created.Calculations of all the features are performed completely automatically, so it allows to conduct the large-scale experiments with artworks of a large volume and speeds up the work of a linguist. To calculate the stylometric features including ones based on the search results for rhythmic figures the ProseRhythmDetector program is used. As a result of its work, each text is presented as a set of the same features of three levels: characters, words, rhythm. Texts are combined by decades, for each decade there are found average values of stylometric features. The obtained models of decades are compared using standard similarity metrics, results of comparison are visualized in the form of the heat maps and dendrograms. Experiments with two corpora of Russian and British texts show that during the 19th-21st centuries there are general trends in style change for both corpora, for example, a decrease in the number of rhythmic figures per sentence, and also particular trends for each language, for example, dynamics of change of the word and sentence lengths. Stylometric features of all levels reveal the similarity in the style of texts published in one century. Also, features of three levels in the complex better demonstrate the uniqueness of each decade than features of a particular level. This study shows the importance of stylometric features as style markers of the different eras and allows us to identify trends in style during several centuries.
Article
Importance. An important problem of using written digital texts as a means of teaching professional written communication in a non-linguistic university is posed. The purpose of the research is to show how modern digital technologies can effectively organize educational activities in the subject-language training of written communication of future specialists in the field of international business. Research methods. The following research methods were used: general theoretical (analysis, synthesis, generalization, specification, functional method) and practical (description, observation, surveys, conversations, statistical analysis method). Results and Discussion. Using internet-resources (news sites, professional online magazines), various tasks are offered that contribute to the formation of effective written communication skills. On the basis of step-by-step activity, the development of discursive skills in the professional sphere takes place, the formation of skills for critical comprehension of processed information, the ability to create a secondary written text. Conclusion. Teaching students to work with digital text is one of the priority tasks of professional foreign language training of a specialist today. Successful mastering of written internet-communication allows students, working with digital resources, to form skills of diversification of the orientation of the profession, its movement towards the development of new digital practices.
Chapter
The automatic assessment of language learners’ competences represents an increasingly promising task thanks to recent developments in NLP and deep learning technologies. In this paper, we propose the use of neural models for classifying English written exams into one of the CEFR competence levels. We employ pre-trained BERT models which provide efficient and rapid language processing on account of attention-based mechanisms and the capacity of capturing long-range sequence features. In particular, we investigate on augmenting the original learner’s text with corrections provided by an automatic tool or by human evaluators. We consider different architectures where the texts and corrections are combined at an early stage, via concatenation before the BERT network, or as late fusion of the BERT embeddings. The proposed approach is evaluated on two open-source datasets: the EFCAMDAT and the CLC-FCE. The experimental results show that the proposed approach can predict the learner’s competence level with remarkably high accuracy, in particular when large labelled corpora are available. In addition, we observed that augmenting the input text with corrections provides further improvement in the automatic language assessment task.
Chapter
Automated essay scoring (AES) is the task of assigning grades to essays. It can be applied for quality assessment as well as pricing on User Generated Content. Previous works mainly consider using the prompt information for scoring. However, some prompts are highly abstract, making it hard to score the essay only based on the relevance between the essay and the prompt. To solve the problem, we design an auxiliary task, where a dynamic semantic matching block is introduced to capture the hidden features with example-based learning. Besides, we provide a hierarchical model that can extract semantic features at both sentence-level and document-level. The weighted combination of the scores is obtained from the features above to get holistic scoring. Experimental results show that our model achieves higher Quadratic Weighted Kappa (QWK) scores on five of the eight prompts compared with previous methods on the ASAP dataset, which demonstrate the effectiveness of our model.
Problema avtomaticheskogo izmereniya slozhnyh konstruktov cherez otkrytye zadaniya,” in HXI Mezhdunarodnaya nauchno-prakticheskaya konferenciya molodyh issledovatelej obrazovaniya
  • N V Galichev
  • P S Shirogorodskaya
N. Galichev and P. Shirogorodskaya, "Problema avtomaticheskogo izmereniya slozhnyh konstruktov cherez otkrytye zadaniya", in HXI Mezhdunarodnaya nauchno-prakticheskaya konferenciya molodyh issledovatelej obrazovaniya, in Russian, Novosibirskij gosudarstvennyj pedagogicheskij universitet, 2022, pp. 695-697.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • J Devlin
  • M.-W Chang
  • K Lee
  • K Toutanova
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding", in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2019, pp. 4171-4186.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  • V Sanh
  • L Debut
  • J Chaumond
  • T Wolf
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter, 2020. arXiv: 1910.01108 [cs.CL].