ArticlePDF Available

Evaluating Text Complexity and Flesch-Kincaid Grade Level

Authors:
  • Kazan Federal Univer

Abstract

The article presents the results of an exploratory study of the use of T.E.R.A., an automated tool measuring text complexity and readability based on the assessment of five text complexity parameters: narrativity, syntactic simplicity, word concreteness, referential cohesion and deep cohesion. Aimed at finding ways to utilize T.E.R.A. for selecting texts with specific parameters we selected eight academic texts with similar Flesch-Kincaid Grade levels and contrasted their complexity parameters scores to find how specific parameters correlate with each other. In this article we demonstrate the correlations between text narrativity and word concreteness, abstractness of the studied texts and Flesch – Kincaid Grade Level. We also confirm that cohesion components do not correlate with Flesch –Kincaid Grade Level. The findings indicate that text parameters utilized in T.E.R.A. contribute to better prediction of text characteristics than traditional readability formulas. The correlations between the text complexity parameters values identified are viewed as beneficial for developing a comprehensive approach to selection of academic texts for a specific target audience. © 2017, Association for Social Studies Educa. All rights reserved.
www.jsser.org
Journal of Social Studies Education Research
Sosyal Bilgiler Eğitimi Araştırmaları Dergisi
2017:8 (3), 238-248
238
Evaluating Text Complexity and Flesch-Kincaid Grade Level
Marina I. Solnyshkina
1
, Radif R. Zamaletdinov
2
, Ludmila A. Gorodetskaya
3
& Azat I.
Gabitov
4
Abstract
The article presents the results of an exploratory study of the use of T.E.R.A., an automated tool
measuring text complexity and readability based on the assessment of five text complexity parameters:
narrativity, syntactic simplicity, word concreteness, referential cohesion and deep cohesion. Aimed at
finding ways to utilize T.E.R.A. for selecting texts with specific parameters we selected eight academic
texts with similar Flesch-Kincaid Grade levels and contrasted their complexity parameters scores to find
how specific parameters correlate with each other. In this article we demonstrate the correlations between
text narrativity and word concreteness, abstractness of the studied texts and Flesch Kincaid Grade
Level. We also confirm that cohesion components do not correlate with Flesch Kincaid Grade Level.
The findings indicate that text parameters utilized in T.E.R.A. contribute to better prediction of text
characteristics than traditional readability formulas. The correlations between the text complexity
parameters values identified are viewed as beneficial for developing a comprehensive approach to
selection of academic texts for a specific target audience.
Keywords: Text complexity, T.E.R.A., Syntactic simplicity, Narrativity, Readability, Texts
analysis.
Introduction
The modern linguistic paradigm comprising achievements of “psycholinguistics,
discourse processes, and cognitive science” (Danielle et al., 2011) provides both a theoretical
foundation, empirical evidence, well-described practices and automated tools to scale texts on
multiple levels including characteristics of words, syntax, referential cohesion, and deep
cohesion. The scope of applications of such tools is enormous: from teaching practices to
cognitive theories of reading and comprehension. One of the tools, T.E.R.A., Coh-Metrix
Common Core Text Ease and Readability Assessor, an automated text processor developed in
early 2010s by a group of American scholars of The Science of Learning and Educational
1
Prof, Kazan Federal University - Kazan, mesoln@yandex.ru
2
Prof, Kazan Federal University - Kazan, director.ifmk@gmail.com
3
Prof, Lomonosov Moscow State University - Moscow, lgorodet@gmail.com
4
Pst. Grad, Kazan Federal University - Kazan, gabit.azat@gmail.com
Journal of Social Studies Education Research 2017: 8 (3),238-248
Technology (SoLET) Lab, directed by Dr. Danielle S. McNamara, has already been successfully
applied in two Russian case studies conducted by A.S. Kiselnikov (Solnyshkina, Harkova &
Kiselnikov, 2014). As the research shows, it is by all means under-used in modern Russian
academic practices in general and in the area of teaching English as a foreign language in
particular. Addressing the gap, we demonstrate how T.E.R.A. can be applied in academic
practices and how a limited number of text parameters in all their varieties, are significant in
selecting texts for academic purposes.
Methods
The data for the study were collected from “Review” Chapters marked A in Spotlight 11
approved by the Ministry of Education of the Russian Federation and recommended for English
language teaching in the 11th grade of Russian public schools. All the texts compiled in the
corpus are the texts used to test students’ reading skills in the classrooms. Their length varies
from 323 words in Text 3A to 494 words in Text 7A with the mean of 395 words. The
readability of the texts selected fall into the scope of the target audience, i.e. Russian high school
graduates, and vary between indices 8 and 9 of Flesch-Kincaid Grade Levels (see Table 1
below). We measured the complexity parameters of the 8 selected texts with the help of T.E.R.A.
and consecutively contrasted two texts with the highest and lowest scores of each complexity
parameter to identify the correlation between a particular index and Flesch-Kincaid Grade Level.
Except for the Flesch-Kincaid Grade Level, T.E.R.A. available on the public website
computes five complexity parameters of texts, i.e. syntactic simplicity, abstractness/concreteness
of words, narrativity, referential cohesion, deep cohesion. Thus, T.E.R.A. provides detailed
information of how logically connected the text is, what functions make the texts more or less
grammatically cohesive, what are the dependencies between one part of the text and another for
each analyzed text, the program assigns definite values thus indicating the position of a particular
text among other texts assessed and stored in the database (T.E.R.A. Coh-Metrix Common Core
Text Ease and Readability Assessor). A user can view texts and their complexity indices in
T.E.R.A online library.
Solnyshkina et al.
Table 1
Complexity Parameters of Texts 1 A - 8 A
Text
Narrativity
Abstractness/
Concreteness
Referential
Cohesion
Deep
Cohesion
Flesh Kincaid
Grade Level
1A
79%
36%
39%
81%
8,20
2A
77%
39%
37%
99%
7,40
3A
92%
70%
40%
74%
6,50
4A
69%
73%
24%
94%
7,30
80%
78%
13%
94%
6,20
75%
14%
20%
94%
9,70
7A
84%
33%
9%
95%
7,50
30%
80%
22%
42%
9,50
According to McNamara and Graesser (2012) narrativity depends on the mean of verbs
per phrase, presence of common words and overall story-like structure. To ensure high
readability of a text, researchers recommend to use a large number of dynamic verbs in a
relatively small variety of time forms, which makes the sentences syntax similar and reduces the
number of words in front of the main verb. In texts with a high narrative value, fewer unique
nouns and more pronouns create similar combinations of sentences. T.E.R.A. assesses
Syntactic simplicity of a text is measured based on three measured parameters, i.e. the
average number of clauses in sentences throughout the text, the number of words in the sentence,
and the number of words in front of the main verb of the main sentence (McNamara & Graesser,
2012). Texts with lower number of clauses, fewer words per sentence and fewer words before
the main verb will have a higher syntactic simplicity value. The correlation of the parameter with
the above mentioned indices was conveniently verified in the research pursued by a group of
Russian scholars on the materials of Unified State Exam in English (EGE), which is a
matriculation exam in the educational system of the Russian Federation (Solnyshkina, Harkova
& Kiselnikov, 2014).
Abstractness/concreteness of words as it comes from the name, shows the proportion of
concrete words to abstract ones in a text (McNamara & Graesser, 2012; Waters & Russell,
2016). Assessing a text abstractness/concreteness T.E.R.A does not provide any instrument to
verify abstractness/concreteness of separate words. However, its developers refer potential
inquirers to the Medical Research Council (MRC) Psycholinguistic Database, containing
Journal of Social Studies Education Research 2017: 8 (3),238-248
150,837 words with 26 specific linguistic and psycholinguistic attributes (Brysbaert, Warriner &
Kuperman, 2014; MRC Psycholinguistic Database; Erbilgin, 2017; Tarman & Baytak, 2012).
The scores are derived based on human judgments of word properties such as concreteness,
familiarity, imageability, meaningfulness, and age of acquisition (MRC Psycholinguistic
Database). The resource acquires a word a rank in the list of ‘less’ or ‘more’ concrete/abstract
words. As the tool assesses the word family tokens only and neglects the context of a word,
MRC Psycholinguistic Database, as it is admitted by the developers and researchers ‘is not
without limitations” (McNamara & Graesser, 2012).
Referential cohesion is a measure of the overlap between words in the text, formed with
the help of similar words and ideas transmitted by them (McCarthy et al., 2006). When sentences
and paragraphs have similar words or ideas, it is easier for the reader to establish logical
connections between them. If a text is cohesive its ideas overlap thus providing a reader with
explicit threads connecting parts of the text. In adjacent sentences the threads are manifested by
co-referencing words, anaphora, similar morphological roots, etc. For example in Text 1A we
find repetitions of the word child, semantic overlap in the words country China, child family,
an only child one child: “I am an only child because, in 1979, the government in my country
introduced a one-child-per-family policy to control China's population explosion” (Text 1A).
Deep cohesion reflects the degree of logical connectives between sentences, but in this
case it is revealed by measuring different types of words that connect parts of the text
(McNamara & Graesser, 2012). There are different types of connectives: temporal, causal,
additive, logical. Examples of these words are after, before, during, later, additionally,
moreover, or. These elements of the text help to link together events, ideas and information of
the text, forming the reader's perception. For example: “The good news, however, is that you
CAN deal with stress before it gets out of hand! So, take control and REMEMBER YOUR A-B-
Cs.” (Text 2A).
We also utilized an online tool Text Inspector to measure lexical diversity of every text
studied. Lexical Diversity is viewed by the authors as “the range of different words used in a
text” (McCarthy & Jarvis, 2010). Text Inspector assesses VOCD (or HD-D) and MTLD. As the
texts in the corpus studied are of about the same length, i.e. about 400 words their lexical
diversity metrics are viewed as reliable, not sensitive to the length of the texts studied. The
Lexical Diversity tool used by Text Inspector is “based on the Perl modules for measuring
Solnyshkina et al.
MTLD and voc-d developed by Aris Xanthos” (Text Inspector). “MTLD is performed two
times, once in left-to-right text order and once in right-to-left text order. Each pass yields a
weighted average (and variance), and the two averages are in turned averaged to get the value
that is finally reported (the two variances are also averaged). This attribute indicates whether the
reported average should itself be weighted according to the potentially different number of
observations in the two passes (value ‘within_and_between’), or not (value ‘within_only’)”
VOCD method implies random selection of “35, 36, …, 49, and 50 tokens from the data, then
computing the average type-token ratio for each of these lengths, and finding the curve that best
fits the type-token ratio curve just produced <…>. The parameter value corresponding to the
best-fitting curve is reported as the result of diversity measurement. The whole procedure can be
repeated several times and averaged” (Text Inspector). Lexical Diversity of Text 6A (393 words)
computed with Text Inspector is 134.75 (VOCD), 116.61 (MTLD) which is viewed as relatively
high (Text Inspector).
Results
To determine the impact of each of the parameters computed by T.E.R.A. on the Flesch-
Kincaid Grade Level and identify correlations between variables of Coh-Metrix, we measured
texts indices of 8 texts from Spotlight 11 (2009) and contrasted vocabulary and grammar of the
texts with minimum and maximum values of narrativity, syntactic simplicity, word concreteness,
referential and deep cohesion. The results of T.E.R.A. processing are presented in Table 1.
It was decided to exclude Text 8 from further analysis based on the assumption that as its
narrativity score twice as low as those of the other texts (30% vs 69% - 92%) and it may lead to a
considerable bias in the research outcomes. Text 8A portrays four sights and is mostly
descriptive. Consider an excerpt from Text 8A: Otherwise known as The Lost City of the Incas',
Machu Picchu is an ancient Incan city located almost 2,500 metres above sea level in the Andes
Mountains in Peru. Machu Picchu is invisible from below (Spotlight 11, Text 8A). As it is shown
in the example above, the author uses mostly stative verbs (know, be, etc.) in contrast to Text 3A
with the highest narrativity index in the corpus of the texts studied, i.e. 92%, in which the verbs
used are mostly dynamic: arrived, gone, checking, had taken, reported, caught. The sentences
are short and easy to understand: Burglars recently broke into our house while we were sleeping
upstairs! My sister and I heard a noise, so we woke up our dad, who called the police (Spotlight
11, Text 3A).The genre also reflects on Concreteness/Abstractness and Deep Cohesion indices:
Journal of Social Studies Education Research 2017: 8 (3),238-248
all narrative texts prove to be more concrete and cohesive than the contrasted descriptive text.
Both Deep cohesion (42%) and Referential Cohesion (22%) indices of Text 8A are significantly
lower than the corresponding parameters of all other texts (Table 1 above).
T.E.R.A. also discriminated the texts which were otherwise similar but had different
scores on Syntactic Simplicity. As we see in Table 1, Syntactic Simplicity in Texts 1A and Text
2A differ significantly with 34% and 65% respectively. The corresponding Flesh Kincaid
Grade levels differ in 1.2., Deep Cohesion 17%, while the rest of the parameters are only 2 % -
3% different. Text 1 A presenting the theme “family” serves a good example of low Syntactic
simplicity score. It contains simple syntactic structures, 27 sentences of which are in the Present
Simple tense, there are no participial or gerundial constructions either. Its lexical diversity is only
91.66 (VOCD), 84.80 (MTLD). All these make the text less challenging to process by the reader
than Text 2A which is at the opposite end of the continuum: with 30 infinitives, 10 gerundial
constructions, 7 verbs in the Present Simple tense, five past participles. Cf. “In stressful
situations, the nervous system causes muscles to tense, breathing to become shallow and
adrenaline to be released into your bloodstream as your body gets ready to beat challenges with
focus and strength” (Spotlight 11, Text 2A). The lexical diversity of Text 2A is also much higher
than in Text 1A: 101.26 (VOCD), 100.80 (58 LD). Thus, we may provisionally conclude that
Syntactic simplicity does not much correlate with Flesh Kincaid Grade Level.
The texts chosen for the contrastive analysis of Word Concreteness are Texts 5A and
Text 6A with Flesh Kincaid Grade Levels of 6.20 and 9.70, respectively. These two texts have
radically different Flash-Kincaid Grade levels (3.5 grade difference), but similar scores of
Narrativity, Syntactic Simplicity, Deep Cohesion. However, the critical difference lies in the
Concreteness/ Abstractness of the words with values of 78% and 14% for Texts 5A and Text 6A,
respectively. Low word concreteness value indicates the presence of a large number of abstract
words in Text 6A. As the theme of Text 6A is the study of alien activities, it contains specific
vocabulary: civilization, intelligent life, signal, screensaver, etc. The vocabulary of Text 5A,
which portrays life of homeless people, consist of predominantly concrete nouns: benches,
doorways, houses, hostel, room, streets etc. Thus, it is obvious that it is mostly Concreteness of
Text 5A that decreases its Flesh Kincaid Grade Level.
Referential Cohesion demonstrates a spike with 40% in Text 3A and falls to 9% in Text
7A. Indices of Narrativity and Syntactic simplicity fluctuate within a narrow range of 8 - 9%,
Solnyshkina et al.
while Concreteness/Abstractness is distinctively diverse with 70% in Text 3A and 33 % in Text
7A. The statistics also shows little relation between Flesh Kincaid Grade Level and Referential
Cohesion (see Table 1 above).
As lexical diversity is proved to be in inverse proportion to cohesion (McNamara &
Graesser, 2012), we also computed Lexical diversity of Texts 7A and 3A. Text Inspector
measures lexical diversity of Text 7A to be 145.56 (VOCD), while that of Text 3A to be only
92.48 (VOCD). Based on the scores we can assume that Text 3A contains more words and ideas
that overlap across adjacent sentences and the entire text, while Text 7A contains fewer explicit
threads that connect the text for the reader. Cf.: “Fortunately, I was able to identify the mugger
from a photo at the police station. He was a well-known criminal in the area, so the police knew
where to find him. Anyway, he confessed to the crime, the police arrested him” (Теxt 3A). As we
can see the connections between the ideas are made with the help of thematic similarity (the
mugger a criminal a crime arrested), repetition (the police), substitution (the mugger he
him he him), derivatives (criminal crime). Referential cohesion for Text 7A is low due to
the lack of lexical and semantic overlap. Cf.: “Believe you can climb that mountain, swim that
ocean or reach that place, and surely one day you will. There would be no Ford cars, Star Wars,
light bulbs or Beethoven symphonies if this was not true!” (Text 3A). Thus, Text 7A is more
challenging for the reader, especially for a non-native speaker. The counterbalance which levels
up Flesh Kincaid Grade Levels of the Texts 3A and Text 7A is Word Concreteness which is
much higher in Text 3A (see the Table above).
The texts demonstrating distinctively different Deep Cohesion are Texts 2A and Text 8A,
which judged from the statistics in Table 1, are also different in the following characteristics:
narrativity, syntactic simplicity, word concreteness and referential cohesion. Deep Cohesion of
Text 2A is extremely high, 99% , which means that the text connections are very dense. It
contains 17 temporal connectives, 3 causal, 7 intentional, while Text 8A incorporates 3 temporal
connectives, 2 causal, 0 intentional connectives (Gabitov & Ilyasova, 2016). At this stage of the
research it is difficult to explain all the correlations between the parameters but the fact that deep
cohesion has very little correlation with Flesh Kincaid Grade Level is obvious.
Discussion
The analysis has showna wide range of possibilities which T.E.R.A. provides for
assessing text complexity parameters and their interrelations. By assessing complexity
Journal of Social Studies Education Research 2017: 8 (3),238-248
parameters it discriminated Text 8A from the rest of the texts studied as a text of different genre:
as a descriptive text Text 8A demonstrated much lower narrativity score than all the other in the
continuum.The question of this text appropriateness as the final reading text in the textbook,
though being urgent, is beyond the scope of this paper.
T.E.R.A. also assesses text syntactic simplicity thus providing a user with an instrument
to measure three different syntactic indices: the number of clauses, the number of words in a
sentence, the number of words before the main verb. The results of this study confirm that
syntactic simplicity measured with T.E.R.A. does not much correlate with Flesch - Kincaid
Grade Level. However the research demonstrated strong correlation between text concreteness
computed with T.E.R.A. and Flesch - Kincaid Grade Level: with all other complexity parameters
of two texts being similar, it is word concreteness that shapes the grade level score. As for
referential cohesion and deep cohesion scores assessed with T.E.R.A., they go beyond traditional
readability formulas, including Flesch - Kincaid Grade Level, i.e. do not correlate with the latter.
Two other phenomena discovered are the following: the score of Referential cohesion of all
narrative texts in the corpus is below 40% with the mean being 26%, while the Deep cohesion
score is above 74% with the mean of 90%.
The complexity parameters measured with T.E.R.A. and the elicited interdependences
between the latter and Flesch-Kincaid Grade level provide a good foundation for educators to
elaborate an extensive approach to selection of reading texts for academic purposes of different
groups of students (Readability Formulas). Several authors have proposed different metric sets to
assess similarity and dissimilarity in text complexity, such as adjective per sentence, nouns per
sentence, frequency of content words, etc. that can successfully rank academic texts for different
age and grade levels (Solovyev, Ivanov & Solnyshkina, 2017).
Conclusion
T.E.R.A. analyses of the text complexity values demonstrated that (1) Narrativity of the
texts studied tends to be in inverse ratio to deep cohesion and directly proportional to word
concreteness. (2) Concreteness of the studied texts displays strong correlation with Flesch
Kincaid Grade Level and potential to decrease the latter. (3) Syntactic simplicity does not
demonstrate much interdependence with Flesch Kincaid Grade Level. (4) The cohesion
components, i.e. referential cohesion and deep cohesion indices, do not correlate with Flesch
Solnyshkina et al.
Kincaid Grade Level. The identified correlations between the text parameters values computed
by T.E.R.A. are viewed by the authors as beneficial for designing an algorithm to select and
modify texts so that they correspond to the cognitive and linguistic level of the target readers.
Journal of Social Studies Education Research 2017: 8 (3),238-248
References
Brysbaert, M., Warriner, A.B. & Kuperman, V. (2014). Concreteness ratings for 40 thousand
generally known English word lemmas. Behavior Research Methods, 46: 904.
https://doi.org/10.3758/s13428-013-0403-5.
Danielle, S. McNamara, D.S., Graesser, A.C., Cai, Z. & Kulikowich, J.M. (2011). Coh-Metrix
Easability Components: Aligning Text Difficulty with Theories of Text Comprehension.
AERA. Retrieved from https://www.researchgate.net/publication/228455723_Coh-
Metrix_Easability_Components_Aligning_Text_Difficulty_with_Theories_of_Text_Comp
rehension.
Erbilgin, E. (2017). A comparison of the mathematical processes embedded in the content
standards of Turkey and Singapore. Research in Social Sciences and Technology, 2(1): 53-
74.
Gabitov, A.I. & Ilyasova, L.G. (2016). Use of automated instruments of text analysis to provide
proper difficulty level of English language educational materials. Problems of Modern
Pedagogical Education: Pedagogy and Psychology, 53(3): 101-108.
McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of
sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42:
381. https://doi.org/10.3758/BRM.42.2.381
McCarthy, Ph.M., Lightman, E.J., Dufty, D.F. & McNamara, D.S. (2006). Using Coh-Metrix to
assess distributions of cohesion and difficulty: An investigation of the structure of high-
school textbooks. In: Proceedings of the 28th Annual Conference of the Cognitive Science
Society (190-195). Mahwah: Eribaum.
McNamara, D.S. & Graesser, A.C. (2012). Coh-Metrix: An automated tool for theoretical and
applied natural language processing. In: Applied natural language processing and content
analysis: Identification, investigation, and resolution (188-205). Hershey, PA: IGI Global.
MRC Psycholinguistic Database. Retrieved from
http://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm.
Readability Formulas. Free readability tools to check for Reading Levels, Reading Assessment,
and Reading Grade Levels. Retrieved from http://www.readabilityformulas.com.
Solnyshkina, M.I., Harkova, E.V. & Kiselnikov, A.S. (2014). Comparative Coh-Metrix Analysis
of Reading Comprehension Texts: Unified (Russian) State Exam in English vs Cambridge
Solnyshkina et al.
First Certificate in English. English Language Teaching, 7(12): 65-76.
Solovyev, V., Ivanov, V. & Solnyshkina, M. (2017). Assessment of Reading Difficulty Levels in
Russian Academic Texts: Approaches and Metrics. In Press.
Tarman, B. & Baytak, A. (2012). “Children’s online language learning: A constructionist
perspective”, Energy Education Science and Technology Part B: Social and Educational Studies,
2012, 4(2) 875-882.
T.E.R.A. Coh-Metrix Common Core Text Ease and Readability Assessor. Retrieved from
http://129.219.222.66:8084/Coh-Metrix.aspx.
Text Inspector. Retrieved from http://textinspector.com/workflow/B3021C1A-706A-11E7-B233-
AB44AFCE53D3.
Waters, S. & Russell, W.B. (2016). Virtually Ready? Pre-service teachers’ perceptions of a
virtual internship experience. Research in Social Sciences and Technology, 1(1): 1-23.
... It shows promise in benchmarking children's text-difficulty ability levels more accurately, thus allowing them to read texts at target readability levels. These formulas typically result in an absolute score or a grade level that indicates the level of text an average reader in that grade is expected to be able to read and understand successfully (Kincaid et al., 1975;Solnyshkina et al., 2017). For example, one of the most well-known readability formulas, the Flesch-Kincaid grade level formula (Kincaid et al., 1975), (0.39 × the average number of words used per sentence) + (11.8 × the average number of syllables per word) -15.59, is designed to result in a grade level that indicates a text's readability. ...
... These formulas usually consider a few linguistic features, however, research has shown that features relating to word, syntax, and discourse levels significantly affect text comprehension in various languages, such as English and Chinese (Crossley et al., 2019;Liu et al., 2024;Pinney et al., 2024;Solnyshkina et al., 2017). At the word level, word length, i.e., the number of characters per word, is a key indicator of text readability. ...
Article
Full-text available
Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context.
... The Flesch-Kincaid Grade Level (FKG) system [42] was used to assess the readability of content produced by the LLMs. The FKG level is a readability test designed to indicate how difficult a text is to understand. ...
... It calculates the grade level required for someone to comprehend the text. The FKG is based on word length and sentence length, providing a numerical score that corresponds to US grade levels [42]. The National Institutes of Health (NIH) and the American Medical Association (AMA) suggest that patient education materials should be written at a reading level no higher than the sixth grade [43]. ...
Article
Full-text available
Background Cancer survivors and their caregivers, particularly those from disadvantaged backgrounds with limited health literacy or racial and ethnic minorities facing language barriers, are at a disproportionately higher risk of experiencing symptom burdens from cancer and its treatments. Large language models (LLMs) offer a promising avenue for generating concise, linguistically appropriate, and accessible educational materials tailored to these populations. However, there is limited research evaluating how effectively LLMs perform in creating targeted content for individuals with diverse literacy and language needs. Objective This study aimed to evaluate the overall performance of LLMs in generating tailored educational content for cancer survivors and their caregivers with limited health literacy or language barriers, compare the performances of 3 Generative Pretrained Transformer (GPT) models (ie, GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo; OpenAI), and examine how different prompting approaches influence the quality of the generated content. Methods We selected 30 topics from national guidelines on cancer care and education. GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo were used to generate tailored content of up to 250 words at a 6th-grade reading level, with translations into Spanish and Chinese for each topic. Two distinct prompting approaches (textual and bulleted) were applied and evaluated. Nine oncology experts evaluated 360 generated responses based on predetermined criteria: word limit, reading level, and quality assessment (ie, clarity, accuracy, relevance, completeness, and comprehensibility). ANOVA (analysis of variance) or chi-square analyses were used to compare differences among the various GPT models and prompts. Results Overall, LLMs showed excellent performance in tailoring educational content, with 74.2% (267/360) adhering to the specified word limit and achieving an average quality assessment score of 8.933 out of 10. However, LLMs showed moderate performance in reading level, with 41.1% (148/360) of content failing to meet the sixth-grade reading level. LLMs demonstrated strong translation capabilities, achieving an accuracy of 96.7% (87/90) for Spanish and 81.1% (73/90) for Chinese translations. Common errors included imprecise scopes, inaccuracies in definitions, and content that lacked actionable recommendations. The more advanced GPT-4 family models showed better overall performance compared to GPT-3.5 Turbo. Prompting GPTs to produce bulleted-format content was likely to result in better educational content compared with textual-format content. Conclusions All 3 LLMs demonstrated high potential for delivering multilingual, concise, and low health literacy educational content for cancer survivors and caregivers who face limited literacy or language barriers. GPT-4 family models were notably more robust. While further refinement is required to ensure simpler reading levels and fully comprehensive information, these findings highlight LLMs as an emerging tool for bridging gaps in cancer education and advancing health equity. Future research should integrate expert feedback, additional prompt engineering strategies, and specialized training data to optimize content accuracy and accessibility.
... This study aims to evaluate the complexity and accessibility of the language used in political manifestos, with the objective of determining whether the text is suitable for the average Indian voter. To achieve this, the study employed two widely recognized readability formulas: the Flesch Reading Ease Score and the Flesch Kincaid Grade Level Score (Solnyshkina et al., 2017). These formulas provide quantitative measures of the readability of each manifesto available in the English text, allowing for an assessment of how easily the general public can comprehend the promises and commitments outlined by the political parties. ...
Article
Full-text available
This study evaluates the readability of 2024 Lok Sabha election manifestos from India's three major political parties: Bharatiya Janata Party (BJP), Indian National Congress (INC), and Communist Party of India (Marxist) CPI(M). Utilizing the Flesch Reading Ease and Flesch-Kincaid Grade Level formulas, the analysis assesses the accessibility of welfare, development and infrastructure, and national security sections within these manifestos. Findings reveal that all manifestos exhibit moderate to high complexity levels, potentially limiting comprehension among average voters. The BJP's manifesto is consistently the most complex across all sections, while the INC and CPI(M) present relatively more accessible texts subsequently making their approach less populist in their election manifesto documents. The study underscores the need for simplifying political communication to enhance democratic engagement.
... This makes it an essential tool for writers, educators, and content creators aiming to match their material to the appropriate reading level for their audience. (Solnyshkina et al., 2017) ...
Thesis
Full-text available
The research analyzes the readability of case study questions of five subjects: FPA, DMI, HRM, PF, and LSG of MPA first semester and textbook of MA. Using Flesch-Kincaid Grade Level, Gunning Fog Index, and Automated Readability Index (ARI), the research finds that significant difference in complexity between the textbooks and case study questions. The findings disclose that the textbook is the most easily, with readability levels suitable for readers with approximately secondary level. The case study questions are considerably more difficult to read, often requiring college or postgraduate-level proficiency, posing potential challenges to student comprehension. Among the subjects, DMI emerges as the most challenging, with the highest scores across all readability metrics, whereas PF is the most readable among the case studies, although still more complex than the textbook. HRM and LSG are moderately difficult, while FPA, though less complex than DMI, remains difficult due to complex sentence structures.
... We used several news articles to evaluate the impact on CT and propaganda awareness. The articles varied in topic and content but were comparable in size and complexity, as given by the Flesch-Kincaid Grade Level (Article 1: 12.0, Article 2: 12.3, Article 3: 12.2) [35]. ...
Preprint
Full-text available
In today's media landscape, propaganda distribution has a significant impact on society. It sows confusion, undermines democratic processes, and leads to increasingly difficult decision-making for news readers. We investigate the lasting effect on critical thinking and propaganda awareness on them when using a propaganda detection and contextualization tool. Building on inoculation theory, which suggests that preemptively exposing individuals to weakened forms of propaganda can improve their resilience against it, we integrate Kahneman's dual-system theory to measure the tools' impact on critical thinking. Through a two-phase online experiment, we measure the effect of several inoculation doses. Our findings show that while the tool increases critical thinking during its use, this increase vanishes without access to the tool. This indicates a single use of the tool does not create a lasting impact. We discuss the implications and propose possible approaches to improve the resilience against propaganda in the long-term.
... In so doing, a "robustness test" was applied for the model. In the literature, a significant number of prior studies examine the impact of CG structures on RDB by considering the "Flesch−Kincaid Grade Level" (FKG) (Huong Dau et al., 2024;Worrall et al., 2020;Solnyshkina et al., 2017). Hence, as a "test of robustness", the impact of CG structures on RDB was assessed by replacing "Flesh Readability Ease score" with the FKG. ...
Article
Full-text available
The readability (RDB) of annual reports (ARs) plays a crucial role in determining the effectiveness of disclosure of information to interested parties, particularly investors. Given that investors rely on the financial information provided in ARs, the chairman’s letter serves as a key communication tool and is the most extensively read section of the report. Consequently, companies are under pressure to provide understandable ARs that can be easily interpreted by investors. Nevertheless, managers sometimes obscure such disclosures in an attempt to bury negative information and hide their own behavior. Drawing from the “managerial obfuscation hypothesis”, this study investigated how the corporate governance (CG) structures affect the RDB of ARs for a sample of 95 banks across seven countries in the MENA region from 2018 to 2022. The findings revealed that board size, frequency of board meetings, and ownership concentration significantly affected the RDB of ARs. Additionally, board independence and gender diversity had a significant negative effect on ARs’ RDB. Conversely, the study found that the presence of role duality within the board had an insignificant effect on ARs’ RDB. As a result, this study recommends enhancing CG structures to enhance the clarity of banks’ reports and boost investor trust.
Article
Background Patients dealing with sensitive issues like penile enlargement (PE) might benefit from YouTube videos. Therefore, it is essential that the textual content of these videos is clear, trustworthy, and of high quality. Aim Are the AI-assisted acquired texts’ qualities and comprehensibilities of YouTube videos about PE enough and suitable for the patients? Methods On October 25, 2024, Google Trends analysis identified the 25 most searched phrases for “Penile enlargement.” Non-related terms were excluded, and the relevant keywords were searched on YouTube. Only content about PE included; excluding duplicates, non-English videos, YouTube shorts, those under 30 seconds, silent, and music-only videos. Videos were transcribed using Whisper AI, and their quality was assessed by M.F.Ş, E.C.T., and Ç.D. using the GQS (global quality scale) and DISCERN, the readability was evaluated via Flesch–Kincaid (FKGL and FKRE) measures. High assessor agreement was noted (Pearson r = 0.912). Videos were categorized by uploader, and metrics such as views, likes, comments, and duration were recorded. The Chi-square test was used for categorical variable comparisons; the Kruskal-Wallis H-Test was applied when normality and homoscedasticity were not met, with Bonferroni post hoc correction for multiple comparisons. Outcomes The mean DISCERN and GQS scores were 51.23 ± 13.1 and 3.32 ± 0.9, respectively. FKRE and FKGL scores were 73.12 ± 11.7 and 5.85 ± 2.1. Physicians (n = 67) produced the most videos, while academic institutions (n = 2) produced the least. No significant differences in text quality were found between groups (P = 0.067 and P = 0.051). Health-related websites exhibited lower FKRE compared to non-healthcare videos (P = 0.002), with a significant difference in FKGL as well (P = 0.019). Results The video exhibited a high level of readability (indicating comprehensibility for almost a 6th-grade student). Text quality, view and like count of the videos uploaded by academic institutions was the highest. Clinical Implications In PE, YouTube video’s health information needs to be better quality and more trustworthy, according to our research. The language used in videos should be easier to understand. Strengths and Limitations This study is the first scientific analysis of YouTube video transcripts on PE using AI, focusing specifically on English content, which limits its applicability to non-English speakers and other platforms. Exclusions of silent and shorter videos may result in the omission of valuable information. Conclusion The need for better quality and trustworthiness in health-related YouTube information, especially for PE is essential. Content makers should stress clear, accessible language and minimize disinformation.
Article
Full-text available
Retrieval-Augmented Generation (RAG) overcomes the main barrier for the adoption of LLM-based chatbots in education: hallucinations. The uncomplicated architecture of RAG chatbots makes it relatively easy to implement chatbots that serve specific purposes and thus are capable of addressing various needs in the educational domain. With five years having passed since the introduction of RAG, the time has come to check the progress attained in its adoption in education. This paper identifies 47 papers dedicated to RAG chatbots’ uses for various kinds of educational purposes, which are analyzed in terms of their character, the target of the support provided by the chatbots, the thematic scope of the knowledge accessible via the chatbots, the underlying large language model, and the character of their evaluation.
Article
Purpose Most mental health providers have yet to adopt progress monitoring and outcome assessment (PMOA) measures. Although a variety of explanations have been proposed in the literature, a key reason is the burden of time and effort necessary for clients and clinicians to complete, interpret and apply the results of PMOA measures. This evaluation explores the feasibility and initial results of employing ChatGPT to analyse clinicians' unstructured session progress notes for PMOA. Methods Using a simulated patient with 17 trainee therapists, the study examined whether artificial intelligence (AI) can assist in generating thematic summaries relevant to clinical progress and outcomes. Therapists' session summaries were combined to evaluate the continuation of key clinical themes across four sessions for a simulated patient. Trainees also provided brief quantitative ratings per session about the patient's working alliance, negative affect (NA), avoidance of NA and levels of distress. Results AI‐generated results found a (a) persistent focus across sessions regarding the patient's relationship issues with an abusive caretaker, reluctance to disclose and avoidance of NA, and (b) substantial convergence between human‐generated and AI‐generated thematic summaries. Discussion Overall, the use of AI to analyse clinical progress notes appears feasible and psychometrically sound. By minimising resources needed by patients and clinicians to produce clinically relevant data, an AI‐augmented approach can reduce a major obstacle to clinicians' adoption of PMOA measures for feedback purposes.
Article
Full-text available
This study compares Turkey's and Singapore's mathematics content standards in terms of the highligthed mathematical processes. A mathematical processes framework was employed to analyze the content standards drawing on the standards for mathematical practice defined by the Common Core State Standards for Mathematics. The standards for mathematical practice include make sense of problems and persevere in solving them, reason abstractly and quantitatively, construct viable arguments and critique the reasoning of others, model with mathematics, use appropriate tools strategically, attend to precision, look for and make use of structure, look for and express regularity in repeated reasoning. The data sources are 2013 mathematics curriculum standards of Turkey and 2013 mathematics syllabus of Singapore for grades 7 and 8. Data analysis revealed that the two countries reflected mathematical processes differently in their content standards. Some mathematical processes are not identified in Turkey's content standards while all mathematical processes are observed in Singapore's content standards. The distribution of the observed mathematical processes are also different in the two countries. Suggestions for future content standards revisions are shared in the paper.
Article
Full-text available
The purpose of this phenomenological study was to understand the experiences of six secondary pre-service teachers that completed a semester long internship with a supervising mentor at a virtual school in the Southeastern United States. The secondary pre-service teachers in this study voluntarily chose a placement in the virtual school over a traditional classroom placement for completion of their initial licensure field experience. This study sought to examine why secondary pre-service teachers chose a virtual internship and what their experiences were like as online instructors. A total of six participants completed a sixty- minute semi-structured interview at the completion of the semester long virtual school internship. Results of the study indicated that secondary pre-service teachers’ primary motivation for entering a virtual internship experience was “convenience.” Additionally, participants felt prepared for future employment in virtual schools, but had some reservations about their prospects in a traditional classroom setting. Keywords: Virtual schools, qualitative, pre-service teachers, teacher education, technology.
Article
Full-text available
The article summarizes the results of the comparative study of Reading comprehension texts used in B2 level tests: Unified (Russia) State Exam in English (EGE) and Cambridge First Certificate in English (FCE). The research conducted was mainly focused on six parameters measured with the Coh-Metrix, a computational tool producing indices of the linguistic and discourse representations of a text: narrativity, syntactic simplicity, word concreteness, referential cohesion, deep cohesion, Flesh Reading Ease. The research shows that the complexity of EGE texts caused by lower than in FCE texts cohesion is balanced with a simpler than in FCE texts syntax and higher narrativity thus resulting in about the same text complexity of the two sets of texts studied. EGE and FCE texts demonstrate correspondence to grade six and very similar Means of Flesh Reading Ease (FCE Mean is 71.06; EGE Mean is 78.25) which fit the band FAIRLY EASY.
Article
Full-text available
Coh-Metrix provides indices for the characteristics of texts on multiple levels of analysis, including word characteristics, sentence characteristics, and the discourse relationships between ideas in text. Coh-Metrix was developed to provide a wide range of indices within one tool. This chapter describes Coh-Metrix and studies that have been conducted validating the Coh-Metrix indices. Coh-Metrix can be used to better understand differences between texts and to explore the extent to which linguistic and discourse features successfully distinguish between text types. Coh-Metrix can also be used to develop and improve natural language processing approaches. We also describe the Coh-Metrix Text Easability Component Scores, which provide a picture of text ease (and hence potential challenges). The Text Easability components provided by Coh-Metrix go beyond traditional readability measures by providing metrics of text characteristics on multiple levels of language and discourse.
Article
Full-text available
Concreteness ratings are presented for 37,058 English words and 2,896 two-word expressions (such as zebra crossing and zoom in), obtained from over 4,000 participants by means of a norming study using Internet crowdsourcing for data collection. Although the instructions stressed that the assessment of word concreteness would be based on experiences involving all senses and motor responses, a comparison with the existing concreteness norms indicates that participants, as before, largely focused on visual and haptic experiences. The reported data set is a subset of a comprehensive list of English lemmas and contains all lemmas known by at least 85 % of the raters. It can be used in future research as a reference list of generally known English lemmas.
Article
Full-text available
The world has been rapidly changing and this change has also been affecting how we learn, how we teach and how we assess teaching and learning. Old, rigid models are getting replaced with more collaborative models. Therefore, 21st century’s language classroom is immensely different from the twentieth century. The focus is no longer on grammar, memorization and learning from rote, but rather learning in an authentic way as well as using language and cultural knowledge as a means to connect to others around the globe. The purpose of this paper is to explore practical implications of constructionizm in learning languages online at early ages. With keen development in technology development and high interest in language learning, there are possible emergences for online language learning. The scope of this paper is to give practical use of online tools for learning second language at primary school level. The current results indicate that students engage in targeted languages, which was new to them, with online learning tool.
Article
The main purpose of this study was to examine the validity of the approach to lexical diversity assessment known as the measure of textual lexical diversity (MTLD). The index for this approach is calculated as the mean length of word strings that maintain a criterion level of lexical variation. To validate the MTLD approach, we compared it against the performances of the primary competing indices in the field, which include vocd-D, TTR, Maas, Yule's K, and an HD-D index derived directly from the hypergeometric distribution function. The comparisons involved assessments of convergent validity, divergent validity, internal validity, and incremental validity. The results of our assessments of these indices across two separate corpora suggest three major findings. First, MTLD performs well with respect to all four types of validity and is, in fact, the only index not found to vary as a function of text length. Second, HD-D is a viable alternative to the vocd-D standard. And third, three of the indices--MTLD, vocd-D (or HD-D), and Maas--appear to capture unique lexical information. We conclude by advising researchers to consider using MTLD, vocd-D (or HD-D), and Maas in their studies, rather than any single index, noting that lexical diversity can be assessed in many ways and each approach may be informative as to the construct under investigation.
Use of automated instruments of text analysis to provide proper difficulty level of English language educational materials
  • A I Gabitov
  • L G Ilyasova
Gabitov, A.I. & Ilyasova, L.G. (2016). Use of automated instruments of text analysis to provide proper difficulty level of English language educational materials. Problems of Modern Pedagogical Education: Pedagogy and Psychology, 53(3): 101-108.
Using Coh-Metrix to assess distributions of cohesion and difficulty: An investigation of the structure of highschool textbooks
  • Ph M Mccarthy
  • E J Lightman
  • D F Dufty
  • D S Mcnamara
McCarthy, Ph.M., Lightman, E.J., Dufty, D.F. & McNamara, D.S. (2006). Using Coh-Metrix to assess distributions of cohesion and difficulty: An investigation of the structure of highschool textbooks. In: Proceedings of the 28th Annual Conference of the Cognitive Science Society (190-195). Mahwah: Eribaum.
Free readability tools to check for Reading Levels, Reading Assessment, and Reading Grade Levels
  • Readability Formulas
Readability Formulas. Free readability tools to check for Reading Levels, Reading Assessment, and Reading Grade Levels. Retrieved from http://www.readabilityformulas.com.