Article

Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Unique Words/Bigrams: Count of unique words/bigrams used across all posts. Flesch Reading Ease: Flesch Reading Ease score, discretized into separate features in increments of 10 from 0 to 100 [28]. Flesch-Kincaid Grade Level: Flesch-Kincaid grade level, discretized into separate features in increments of 1 from 0 to 20 [28]. ...
... Flesch Reading Ease: Flesch Reading Ease score, discretized into separate features in increments of 10 from 0 to 100 [28]. Flesch-Kincaid Grade Level: Flesch-Kincaid grade level, discretized into separate features in increments of 1 from 0 to 20 [28]. Net Votes Received: Total net upvotes users' posts received (positive -negative). ...
... (29) glmnet_0_0.01_quiz (28) svmLinear2_10_all (27) glmnet_0_1_clickstream (26) nb_1_TRUE_1_all (25) nb_0_TRUE_1_all (24) glmnet_0_0.1_clickstream (23) glmnet_0_0.01_clickstream ...
Preprint
Model evaluation -- the process of making inferences about the performance of predictive models -- is a critical component of predictive modeling research in learning analytics. We survey the state of the practice with respect to model evaluation in learning analytics, which overwhelmingly uses only naive methods for model evaluation or statistical tests which are not appropriate for predictive model evaluation. We conduct a critical comparison of both null hypothesis significance testing (NHST) and a preferred Bayesian method for model evaluation. Finally, we apply three methods -- the na{\"i}ve average commonly used in learning analytics, NHST, and Bayesian -- to a predictive modeling experiment on a large set of MOOC data. We compare 96 different predictive models, including different feature sets, statistical modeling algorithms, and tuning hyperparameters for each, using this case study to demonstrate the different experimental conclusions these evaluation techniques provide.
... Readability features evaluate the readability level of texts as done in Flesch (1948); Kincaid et al. (1975); Shijaku and Canhasi (2023) using Flesch Reading Ease and Flesch-Kincaid Grade Level. We use the Python library Textstat 17 to compute these metrics. ...
... The Flesch Reading Ease is an index of text readability, where higher scores denote easier readability and lower scores suggest more complex texts (Flesch, 1948). The Flesch-Kincaid Grade Level offers a grade-level equivalence for the text's comprehensibility, facilitating its assessment by educators, guardians, and librarians for suitability (Kincaid et al., 1975). Both scores are derived from a formula that takes into account the total number of words, sentences, and syllables in the text (Flesch, 1948;Kincaid et al., 1975). ...
... The Flesch-Kincaid Grade Level offers a grade-level equivalence for the text's comprehensibility, facilitating its assessment by educators, guardians, and librarians for suitability (Kincaid et al., 1975). Both scores are derived from a formula that takes into account the total number of words, sentences, and syllables in the text (Flesch, 1948;Kincaid et al., 1975). Figure 10 shows the Flesch-Kincaid Grade Level score distribution for EN in our Wikipedia data set. ...
Article
Full-text available
Chatbots based on large language models (LLMs) like ChatGPT are available to the wide public. These tools can for instance be used by students to generate essays or whole theses from scratch or by rephrasing an existing text. But how does for instance a teacher know whether a text is written by a student or an AI? In this paper, we investigate perplexity, semantic, list lookup, document, error-based, readability, AI feedback and text vector features to classify human-generated and AI-generated texts from the educational domain as well as news articles. We analyze two scenarios: (1) The detection of text generated by AI from scratch, and (2) the detection of text rephrased by AI. Since we assumed that classification is more difficult when the AI has been prompted to create or rephrase the text in a way that a human would not recognize that it was generated or rephrased by an AI, we also investigate this advanced prompting scenario. To train, fine-tune and test the classifiers, we created the Multilingual Human-AI-Generated Text Corpus which contains human-generated, AI-generated and AI-rephrased texts from the educational domain in English, French, German, and Spanish and English texts from the news domain. We demonstrate that the same features can be used for the detection of AI-generated and AI-rephrased texts from the educational domain in all languages and the detection of AI-generated and AI-rephrased news texts. Our best systems significantly outperform GPTZero and ZeroGPT—state-of-the-art systems for the detection of AI-generated text. Our best text rephrasing detection system even outperforms GPTZero by 181.3% relative in F1-score.
... Essentially, the formulas for both tests are based on calculating the average number of words per sentence and the average number of syllables per word by assigning different weights to each of these measures. The calculations are as follows [20,21]: ...
... Values between 0 and 6 indicate basic reading level, values between 6 and 12 indicate average reading level and values between 12 and 18 indicate advanced reading level. Values of 0-3 indicate that it is easily understandable by kindergartes or elementary school students; values of 3-6 indicate elementary students; values of 6-9 indicate middle school students; values of 9-12 indicate high school students; values of 12-15 indicate college students; values of 15-18 indicate college graduates[21,23]. ...
... Essentially, the formulas for both tests are based on calculating the average number of words per sentence and the average number of syllables per word by assigning different weights to each of these measures. The calculations are as follows [20,21]: ...
... Values between 0 and 6 indicate basic reading level, values between 6 and 12 indicate average reading level and values between 12 and 18 indicate advanced reading level. Values of 0-3 indicate that it is easily understandable by kindergartes or elementary school students; values of 3-6 indicate elementary students; values of 6-9 indicate middle school students; values of 9-12 indicate high school students; values of 12-15 indicate college students; values of 15-18 indicate college graduates[21,23]. ...
Article
Full-text available
Background The use of ChatGPT in the field of health has recently gained popularity. In the field of dentistry, ChatGPT can provide services in areas such as, dental education and patient education. The aim of this study was to evaluate the quality, readability and originality of pediatric patient/parent information and academic content produced by ChatGPT in the field of pediatric dentistry. Methods A total of 60 questions were asked to ChatGPT for each topic (dental trauma, fluoride, and tooth eruption/oral health) consisting of pediatric patient/parent questions and academic questions. The modified Global Quality Scale (the scoring ranges from 1: poor quality to 5: excellent quality) was used to evaluate the quality of the answers and Flesch Reading Ease and Flesch-Kincaid Grade Level were used to evaluate the readability. A similarity index was used to compare the quantitative similarity of the answers given by the software with the guidelines and academic references in different databases. Results The evaluation of answers quality revealed an average score of 4.3 ± 0.7 for pediatric patient/parent questions and 3.7 ± 0.8 for academic questions, indicating a statistically significant difference (p < 0.05). Academic questions regarding dental trauma received the lowest scores (p < 0.05). However, no significant differences were observed in readability and similarity between ChatGPT answers for different question groups and topics (p > 0.05). Conclusions In pediatric dentistry, ChatGPT provides quality information to patients/parents. ChatGPT, which is difficult to readability for patients/parents and offers an acceptable similarity rate, needs to be improved in order to interact with people more efficiently and fluently.
... • Automated Readability Index (Kincaid, 1975) • Coleman-Liau Index (Coleman & Liau, 1975) • Flesch-Kincaid Grade Level (Kincaid, 1975) • Gunning Fog Index (Gunning, 1952) • Linsear Write Index (Klare, 1974) • SMOG Index (McLaughlin, 1969) ISSN: 2473-4901 Albuquerque, NM v9 n5948 ...
... • Automated Readability Index (Kincaid, 1975) • Coleman-Liau Index (Coleman & Liau, 1975) • Flesch-Kincaid Grade Level (Kincaid, 1975) • Gunning Fog Index (Gunning, 1952) • Linsear Write Index (Klare, 1974) • SMOG Index (McLaughlin, 1969) ISSN: 2473-4901 Albuquerque, NM v9 n5948 ...
Conference Paper
Full-text available
This study explores the potential of large language models (LLMs), specifically GPT-4 and Bard, in generating teaching cases for information systems (IS) courses. A unique prompt for writing three different types of teaching cases (i.e. a descriptive case, a normative case, and a project-based case) on the same IS topic (i.e. the introduction of blockchain technology in an insurance company) was developed and submitted to each LLM. The generated teaching cases from each LLM were subsequently assessed using subjective content evaluation measures (i.e. relevance and accuracy, complexity and depth, structure and coherence, and creativity) as well as objective readability measures (i.e. Automated Readability Index, Coleman-Liau Index, Flesch-Kincaid Grade Level, Gunning Fog Index, Linsear Write Index, and SMOG Index). The findings suggest that while both LLMs perform well on objective measures, GPT-4 outperforms Bard on subjective measures, indicating a superior ability to create content that is more relevant, complex, structured, coherent, and creative. This research provides initial empirical evidence and highlights the promise of LLMs in enhancing IS education while also acknowledging the need for careful proofreading and further research to optimize their use.
... The readability was measured using the Flesch Reading Ease Score calculator. The scoring system ranges from 0 to 100, based on which the readability is determined as easy or difficult [45,46]. ...
Article
Full-text available
Background: In recent years, there has been remarkable growth in AI-based applications in healthcare, with a significant breakthrough marked by the launch of large language models (LLMs) such as ChatGPT and Google Bard. Patients and health professional students commonly utilize these models due to their accessibility. The increasing use of LLMs in healthcare necessitates an evaluation of their ability to generate accurate and reliable responses. Objective: This study assessed the performance of LLMs in answering orthodontic-related queries through a systematic review and meta-analysis. Methods: A comprehensive search of PubMed, Web of Science, Embase, Scopus, and Google Scholar was conducted up to 31 October 2024. The quality of the included studies was evaluated using the Prediction model Risk of Bias Assessment Tool (PROBAST), and R Studio software (Version 4.4.0) was employed for meta-analysis and heterogeneity assessment. Results: Out of 278 retrieved articles, 10 studies were included. The most commonly used LLM was ChatGPT (10/10, 100% of papers), followed by Google's Bard/Gemini (3/10, 30% of papers), and Microsoft's Bing/Copilot AI (2/10, 20% of papers). Accuracy was primarily evaluated using Likert scales, while the DISCERN tool was frequently applied for reliability assessment. The meta-analysis indicated that the LLMs, such as ChatGPT-4 and other models, do not significantly differ in generating responses to queries related to the specialty of orthodontics. The forest plot revealed a Standard Mean Deviation of 0.01 [CI: 0.42-0.44]. No heterogeneity was observed between the experimental group (ChatGPT-3.5, Gemini, and Copilot) and the control group (ChatGPT-4). However, most studies exhibited a high PROBAST risk of bias due to the lack of standardized evaluation tools. Conclusions: ChatGPT-4 has been extensively used for a variety of tasks and has demonstrated advanced and encouraging outcomes compared to other LLMs, and thus can be regarded as a valuable tool for enhancing educational and learning experiences. While LLMs can generate comprehensive responses, their reliability is compromised by the absence of peer-reviewed references, necessitating expert oversight in healthcare applications.
... The grade levels are determined based on the standards of the US education system with higher scores indicating increased difficulty to read and comprehend. The readability index works inversely to the grade level, meaning higher scores indicate increased readability and comprehension [9]. Given the inherent method of information generation by large language models, concerns regarding plagiarism were assessed by utilizing the online tool Quillbot plagiarism checker to evaluate the similarity of the responses to existing literature [10,11]. ...
Article
There is a growing importance for patients to easily access information regarding their medical conditions to improve their understanding and participation in health care decisions. Artificial Intelligence (AI) has proven as a fast, efficient, and effective tool in educating patients regarding their health care conditions. The aim of the study is to compare the responses provided by AI tools, ChatGPT and Google Gemini, to assess for conciseness and understandability of information provided for the medical conditions Deep vein thrombosis, decubitus ulcers, and hemorrhoids. A cross-sectional original research design was conducted regarding the responses generated by ChatGPT and Google Gemini for the post-surgical complications of Deep vein thrombosis, decubitus ulcers, and hemorrhoids. Each response was evaluated by the Flesch-Kincaid calculator for total number of words, sentences, average words per sentence, average syllables per word, grade level, and ease score. Additionally, the similarity score was evaluated using QuillBot and reliability using a modified discern score. These results were then analyzed by the unpaired or two sample t-test to compare the averages between the two AI tools to conclude which one was superior. Chat GPT required a higher education level to understand as suggested by the higher grade levels and lower ease scores. The easiest brochure was for deep vein thrombosis which had the lowest ease score and highest grade level. ChatGPT displayed more similarity with information provided on the internet as calculated by the plagiarism calculator—Quill bot. The reliability score via the Modified Discern score showing both AI tools were similar. Although there is a difference in the various scores for each AI tool, based on the P values obtained there is not enough evidence to conclude the superiority of one AI tool over the other.
... Data Acquisition P EMs published online by the American Association of Hip and Knee Surgeons were compiled (n = 48). The Flesch-Kincaid Grade Level (FKGL) and Flesch-Kincaid Reading Ease (FKRE) scores, word count, sentence count, mean number of syllables per word, and mean number of syllables per sentence were obtained using the Textstat and NumPy Python packages 26,27 . PEMs were then manually transformed using a standardized prompt in GPT-3.5 (https://chat.openai.com/; ...
Article
Full-text available
Background This study assesses the effectiveness of large language models (LLMs) in simplifying complex language within orthopaedic patient education materials (PEMs) and identifies predictive factors for successful text transformation. Methods We transformed 48 orthopaedic PEMs using GPT-4, GPT-3.5, Claude 2, and Llama 2. The readability, quantified by the Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores, was measured before and after transformation. Analysis included text characteristics such as syllable count, word length, and sentence length. Statistical and machine learning methods evaluated the correlations and predictive capacity of these features for transformation success. Results All LLMs improved FKRE and FKGL scores (p < 0.01). GPT-4 showed superior performance, transforming PEMs to a seventh-grade reading level (mean FKGL, 6.72 ± 0.99), with higher FKRE and lower FKGL than other models. GPT-3.5, Claude 2, and Llama 2 significantly shortened sentences and overall text length (p < 0.01). Importantly, correlation analysis revealed that transformation success varied substantially with the model used, depending on original text factors such as word length and sentence complexity. Conclusions LLMs successfully simplify orthopaedic PEMs, with GPT-4 leading in readability improvement. This study highlights the importance of initial text characteristics in determining the effectiveness of LLM transformations, offering insights for optimizing orthopaedic health literacy initiatives using artificial intelligence (AI). Clinical Relevance This study provides critical insights into the ability of LLMs to simplify complex orthopaedic PEMs, enhancing their readability without compromising informational integrity. By identifying predictive factors for successful text transformation, this research supports the application of AI in improving health literacy, potentially leading to better patient comprehension and outcomes in orthopaedic care.
... This score is particularly valuable in healthcare contexts, where clear communication of complex medical information is crucial for patient comprehension and decision-making. Researchers and healthcare professionals often rely on FKRE to ensure that health materials are presented in a manner that is accessible and understandable to patients and the general public 11 . ...
Article
Full-text available
This study aimed to investigate the quality, readability, and comprehensibility of responses provided by ChatGPT to the most frequently searched topics using the keywords “fibromyalgia” and “fibromyalgia treatment.” Popular keywords for “fibromyalgia” and “fibromyalgia treatment” were obtained using Google Trends and entered into ChatGPT 3.5. The Ensuring Quality Information for Patients (EQIP), Flesch- Kincaid Reading Ease (FKRE), and Flesch-Kincaid Grade Level (FKGL) scores were calculated for the generated responses. These scores were assessed by two experts, and comparisons were made between the fibromyalgia and fibromyalgia treatment keywords. General fibromyalgia information had a mean FKRE score of -2.98 ± 13.53. The mean FKGL score was 19.18 ± 2.36. The mean EQIP score was 41.60 ± 9.18, reflecting high-quality information. Statistically significant differences were observed in EQIP scores among different categories, with the “Drug, Medication, or Product” group scoring lower than other groups. Comparisons between fibromyalgia and fibromyalgia treatment keywords showed that general fibromyalgia information had higher FKRE scores, while treatment-related content had higher FKGL scores, indicating greater complexity. The findings highlight that while general information about fibromyalgia provided by ChatGPT is more readable, texts concerning fibromyalgia treatment are more complex. Bangladesh Journal of Medical Science Vol. 24 No. 01 January’25 Page : 111-118
... To assess the readability of our generated documents, we employ established readability metrics, including Kincaid [40], Flesh Reading Ease (FRE) [26] and ARI [68], and compare our collection GenTREC with Disk 4-5. Table 4 shows the results. ...
Preprint
Full-text available
Building test collections for Information Retrieval evaluation has traditionally been a resource-intensive and time-consuming task, primarily due to the dependence on manual relevance judgments. While various cost-effective strategies have been explored, the development of such collections remains a significant challenge. In this paper, we present GenTREC , the first test collection constructed entirely from documents generated by a Large Language Model (LLM), eliminating the need for manual relevance judgments. Our approach is based on the assumption that documents generated by an LLM are inherently relevant to the prompts used for their generation. Based on this heuristic, we utilized existing TREC search topics to generate documents. We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant. To introduce realistic retrieval challenges, we also generated non-relevant documents, ensuring that IR systems are tested against a diverse and robust set of materials. The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments". We conducted extensive experiments to evaluate GenTREC in terms of document quality, relevance judgment accuracy, and evaluation reliability. Notably, our findings indicate that the ranking of IR systems using GenTREC is compatible with the evaluations conducted using traditional TREC test collections, particularly for P@100, MAP, and RPrec metrics. Overall, our results show that our proposed approach offers a promising, low-cost alternative for IR evaluation, significantly reducing the burden of building and maintaining future IR evaluation resources.
... We evaluate our method using BLEU [17], ROUGE [14], TF-IDF Similarity [22], Jaccard Similarity [9], BERTScore [25], compression Rate [20], and Flesch-Kincaid readability score [10]. Appendix A.3 shows the detailed description of these encoders. ...
Preprint
Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
... SD = 4.2). The Flesch-Kincaid grade level (Kincaid, Fishburne, Rogers, & Chissom, 1975) was used to calculate the readability scores (M = 5.48, SD = 1.11) using an online instrument (https://charactercalculator.com/flesch-reading-ease/, 2023). ...
Article
Full-text available
A common way of acquiring multiword expressions is through language input, such as during reading and listening. However, this type of learning is slow. Identifying approaches that optimize learning from input, therefore, is an important language-learning endeavor. In the present study, 85 learners of English as a foreign language read short texts with 42 figurative English phrasal verbs, repeated three times. In a counterbalanced design, we manipulated access to definitions (before text, after text, no definition) and typographic enhancement (with bolding, without bolding). The learning was measured by immediate and delayed gap-fill and meaning generation posttests. All posttests showed that learning with definitions was better than without, and that access to definitions after reading was more beneficial than before reading. Typographic enhancement effectively promoted con-textual learning of phrasal verbs and increased the learning advantage associated with presenting definitions after reading.
... We therefore consider there is potential to develop more complex scales for evaluating these implications. For example, the measure which assesses complicatedness of the message could be adapted to incorporate approaches from readability tests that aim to determined how easy or difficult the text is to understand (e. g., Kincaid, Fishburne, Rogers, & Chissom, 1975). ...
... Initially developed for educational applications, this metric has been increasingly applied in various contexts. • Automated Reading Index: The Automated Readability Index (ARI) is another well-known readability test for English texts [27]. The ARI score leverages the charactersper-word metric as an alternative to the syllable-perword approach. ...
Preprint
Full-text available
The study illustrates a first step towards an ongoing work aimed at developing a dataset of dialogues potentially useful for customer service conversation management between humans and AI chatbots. The approach exploits ChatGPT 3.5 to generate dialogues. One of the requirements is that the dialogue is characterized by a specific language proficiency level of the user; the other one is that the user expresses a specific emotion during the interaction. The generated dialogues were then evaluated for overall quality. The complexity of the language used by both humans and AI agents, has been evaluated by using standard complexity measurements. Furthermore, the attitudes and interaction patterns exhibited by the chatbot at each turn have been stored for further detection of common conversation patterns in specific emotional contexts. The methodology could improve human-AI dialogue effectiveness and serve as a basis for systems that can learn from user interactions.
... Scores range from 0 (no cohesion) to 1 (full thematic consistency). 3) Flesch-Kincaid Score [28] estimates readability, indicating the U.S. grade level needed to understand the text. A higher score suggests advanced content suitable for expert readers, with an ideal range balancing accessibility and sophistication (0-16 scale) for academic purposes. ...
Preprint
Full-text available
Multi-Agent Large Language Models (LLMs) are gaining significant attention for their ability to harness collective intelligence in complex problem-solving, decision-making, and planning tasks. This aligns with the concept of the wisdom of crowds, where diverse agents contribute collectively to generating effective solutions, making it particularly suitable for educational settings. Senior design projects, also known as capstone or final year projects, are pivotal in engineering education as they integrate theoretical knowledge with practical application, fostering critical thinking, teamwork, and real-world problem-solving skills. In this paper, we explore the use of Multi-Agent LLMs in supporting these senior design projects undertaken by engineering students, which often involve multidisciplinary considerations and conflicting objectives, such as optimizing technical performance while addressing ethical, social, and environmental concerns. We propose a framework where distinct LLM agents represent different expert perspectives, such as problem formulation agents, system complexity agents, societal and ethical agents, or project managers, thus facilitating a holistic problem-solving approach. This implementation leverages standard multi-agent system (MAS) concepts such as coordination, cooperation, and negotiation, incorporating prompt engineering to develop diverse personas for each agent. These agents engage in rich, collaborative dialogues to simulate human engineering teams, guided by principles from swarm AI to efficiently balance individual contributions towards a unified solution. We adapt these techniques to create a collaboration structure for LLM agents, encouraging interdisciplinary reasoning and negotiation similar to real-world senior design projects. To assess the efficacy of this framework, we collected six proposals of engineering and computer science of...
... • Entropy: A measure of the randomness and unpredictability of word sequences within a text, entropy provides an assessment of the text's information content. • Flesch-Kincaid Index [33]: This index estimates the educational level required to comprehend a text, based on the average number of syllables per word and the average number of words per sentence. • Coleman-Liau Index [34]: This metric assesses text complexity by calculating the average number of characters per word and the average number of words per sentence. ...
Article
Full-text available
Identifying potentially high-performing students is crucial for universities aiming to enhance educational outcomes, for companies seeking to recruit top talents early, and for advertising platforms looking to optimize targeted marketing. This paper introduces an algorithm designed to identify students with exceptional academic performance by analyzing their subscriptions to communities on the social network VKontakte. The study examines a sample of 4445 students from Tomsk State University with publicly accessible VK profiles. The research methodology involves generating vector representations for each community based on embeddings, topic modeling, sentiment and emotion analysis, as well as text complexity metrics. To generate the embeddings, a separate model was trained and made publicly available on HuggingFace. The integration of diverse features was achieved using attention mechanisms, allowing the model to dynamically weigh their importance and capture intricate interrelations. These representations are then used to construct a digital user profile, capturing the students' interests as reflected in their community subscriptions. Additionally, the machine learning pipeline incorporated stacking to combine predictions from multiple models, enhancing robustness and classification performance. Through a series of experiments, we developed a machine learning algorithm that effectively distinguishes between high- and low-performing students based on these profiles. This approach also enabled the identification and interpretation of key factors differentiating high-performing students from their lower-performing peers. Additionally, we investigated the factors positively and negatively associated with academic performance.
... Based on the devised definitions of the seven scales of perfectionism literacy, we generated an initial pool of 84 items. These items were then reviewed on their clarity, readability (based on Flesch-Kincaid grade level scores; Kincaid et al., 1975), relevance, similarity to other items generated and items of existing perfectionism scales and their appropriateness for lay populations. This process resulted in a revised pool of 76 items. ...
Article
Full-text available
Perfectionism is a multidimensional personality characteristic associated with mental health problems. However, its features are commonly misunderstood, and many people are unaware of the risks it can pose. This study aimed to develop the first self-report measure of perfectionism literacy. That is, the degree of knowledge someone has about perfectionism, its features and consequences, and when and where to seek help if needed. The Perfectionism Literacy Questionnaire (PLQ) was validated over four stages using four samples of community adults (N = 1078 total; Mage = 37.17 years). In stage one, we generated a pool of items. In stage two, we used exploratory and confirmatory factor analysis to derive a 29-item, seven-factor measure. In stage three, we assessed relationships between the PLQ, perfectionism, and attitudes toward help-seeking for mental health support and found the PLQ is distinguishable from these constructs. In stage four, we examined whether the PLQ was responsive to change following an educational video on perfectionism. We found tentative evidence that minimal intervention can increase perfectionism literacy. Our findings suggest that the PLQ is valid and reliable and may be useful for educational purposes and primary prevention of mental health problems.
... Nevertheless, although both scores show some correlations with human assessments, they are not reliable enough for comparing performances of different simplification systems. Some studies also use the Flesch-Kincaid Readability Index for automatic evaluation (Kincaid et al., 1975). Although well-known in readability research, this metric is not adequate for sentence simplification since it is unsuitable for sentence level simplicity evaluation. ...
Article
Full-text available
Access to information is a fundamental human right that contributes to freedom of expression and self-determination. However, information availability alone is not enough: the way in which information is expressed and presented on paper or in digital format can be extremely complicated to understand and act upon for a highly diverse set of people with varying ranges of reading, writing and understanding abilities. For a long time, artificial intelligence (AI) and natural language processing (NLP) have dedicated research efforts in the field of automatic text simplification to the development of methods to automate the production of easy-to-read texts. However, in spite of AI hype, the problem persists. Current NLP models such as large language models (LLMs) are still not well understood, and their use in the development of text simplification needs careful assessment. In this paper, having identified the need for easy-to-read, accessible texts, we provide a light overview of current methods in NLP and provide a lay explanation of several techniques applied to lexically and syntactically modified texts to make them simpler to read and understand. We conclude by raising awareness on the use of general tools to address this very challenging problem that can affect contemporary lives.
... Most formulae incorporate a representation of sentence length or the number of words per sentence as a proxy for syntactic complexity. This can be seen in the Gunning-Fog Index (GFI; [33]), the Flesch family of indices [34], Dale-Chall [35], the simple measure of gobbledygook (SMOG; [36]), and the LIX readability formula [18]. ...
Conference Paper
Full-text available
Reading skills are crucial for students' success in education and beyond. However, reading proficiency among K-12 students has been declining globally, including in Sweden, leaving many under-prepared for post-secondary education. Additionally, an increasing number of students have reading disorders, such as dyslexia, which require support. Generative artificial intelligence (genAI) technologies , like ChatGPT, may offer new opportunities to improve reading practices by enhancing the readability of educational texts. This study investigates whether ChatGPT-4 can simplify academic texts and which prompting strategies are most effective. We tasked ChatGPT to rewrite 136 academic texts using four prompting approaches: Standard, Meta, Roleplay, and Chain-of-Thought. All four approaches improved text readability, with Meta performing the best overall and the Standard prompt sometimes creating texts that were less readable than the original. This study found variability in the simplified texts, suggesting that different strategies should be used based on the specific needs of individual learners. Overall, the findings highlight the potential of genAI tools, like ChatGPT, to improve the accessibility of academic texts, offering valuable support for students with reading difficulties and promoting more equitable learning opportunities.
... The FKG score was first developed in 1975 for the United States Navy to assess the readability of military manuals. It is possible to get an idea of the readability level of the texts with this score, which is determined as the US grade level (Kincaid et al. 1975). ...
... Previous studies have commonly posited that readability formulas reflect text difficulty and grade level, with an inverse correlation between readability score and text difficulty. Three readability formulas-the Automated Readability Index, the Fog Count and the Flesch Reading Ease Formula-have been employed by the Navy to assess the comprehension level of training manuals (Kincaid et al., 1975). Additionally, formulas such as Dale-Chall, Fog, SMOG and Flesch-Kincaid have been used to predict the difficulty of junior high school English textbooks, with the readability indices derived from these formulas indicating the appropriate grade level for the textbooks (Gopal et al., 2021). ...
Article
Full-text available
Mathematical stories can enhance students' motivation and interest in learning mathematics, thereby positively impacting their academic performance. However, due to resource constraints faced by the creators, generative artificial intelligence (GAI) is employed to create mathematical stories accompanied by images. This study introduces a method for automatically assessing the quality of these multimodal stories by evaluating text‐image coherence and textual readability. Using GAI‐generated stories for grades 3 to 5 from the US math story learning platform Read Solve Create (RSC), we extracted features related to multimodal semantics and text readability. We then analysed the correlation between these features and student engagement levels, measured by average reading time per story (behavioural engagement) and average drawing tool usage per story (cognitive engagement), derived from browsing logs and interaction metrics on the platform. Our findings reveal that textual features such as conjunctive adverbs, sentence connectors, causal connectives and simplified vocabulary positively correlate with behavioural engagement. Additionally, higher semantic similarity between text and images, as well as the number of operators in the stories, is associated with increased cognitive engagement. This study advances the application of GAI in mathematics education and offers novel insights for instructional material design. Practitioner notes What is already known about this topic Mathematical stories can enhance students' motivation and interest in mathematics, leading to improved academic performance. Generative artificial intelligence (GAI) has been increasingly employed to create multimodal educational content, including mathematical stories with accompanying images, to address content creators' resource constraints. Prior readability research has primarily focused on the analysis of text‐based educational content, with less emphasis on the integration and analysis of visual elements. What this paper adds Introduces a novel automated multimodal readability assessment method that evaluates the coherence between text and images and the readability of text in GAI‐generated mathematical stories. Identifies specific story features, such as the more frequent use of three types of conjunctions (adversative conjunctions, common sentence conjunctions and logical conjunctions) and vocabulary simplicity that correlate with student engagement. Implications for practice and/or policy Educators and curriculum developers are encouraged to utilise automated multimodal readability assessment tools to analyse and refine GAI‐generated educational content, aiming to enhance student engagement and learning experience. Suggestions for the design of educational content includes the consideration of identified readability features that correlate with higher engagement. Caution should be exercised in handling the association between images and text considering the cognitive load of the instructional materials.
... • Readability: We measure the readability of the generated counterspeech messages via the Flesch Reading Ease (FRES) score [44]. FRES evaluates readability based on the average sentence length and the average number of syllables per word. ...
Preprint
AI-generated counterspeech offers a promising and scalable strategy to curb online toxicity through direct replies that promote civil discourse. However, current counterspeech is one-size-fits-all, lacking adaptation to the moderation context and the users involved. We propose and evaluate multiple strategies for generating tailored counterspeech that is adapted to the moderation context and personalized for the moderated user. We instruct an LLaMA2-13B model to generate counterspeech, experimenting with various configurations based on different contextual information and fine-tuning strategies. We identify the configurations that generate persuasive counterspeech through a combination of quantitative indicators and human evaluations collected via a pre-registered mixed-design crowdsourcing experiment. Results show that contextualized counterspeech can significantly outperform state-of-the-art generic counterspeech in adequacy and persuasiveness, without compromising other characteristics. Our findings also reveal a poor correlation between quantitative indicators and human evaluations, suggesting that these methods assess different aspects and highlighting the need for nuanced evaluation methodologies. The effectiveness of contextualized AI-generated counterspeech and the divergence between human and algorithmic evaluations underscore the importance of increased human-AI collaboration in content moderation.
... This was done by using open source implementations of various readability measures in the NLTK-contrib package of the Natural Language Toolkit (NLTK). More specifically, eight measures were selected as features for the Readability representation: ARI [79], Flesch Reading Ease [33], Flesch-Kincaid Grade Level [52], Gunning Fog Index [38], SMOG Index [61], Coleman Liau Index [16], LIX, and RIX [4]. While these measures are originally developed for written text (and ordinarily may need longer textual input than a few sentences in a transcript), they do reflect complexity in language usage. ...
Preprint
Explainability and interpretability are two critical aspects of decision support systems. Within computer vision, they are critical in certain tasks related to human behavior analysis such as in health care applications. Despite their importance, it is only recently that researchers are starting to explore these aspects. This paper provides an introduction to explainability and interpretability in the context of computer vision with an emphasis on looking at people tasks. Specifically, we review and study those mechanisms in the context of first impressions analysis. To the best of our knowledge, this is the first effort in this direction. Additionally, we describe a challenge we organized on explainability in first impressions analysis from video. We analyze in detail the newly introduced data set, the evaluation protocol, and summarize the results of the challenge. Finally, derived from our study, we outline research opportunities that we foresee will be decisive in the near future for the development of the explainable computer vision field.
... • answer's Flesch Reading Ease [19], or readability, score, • answerer's tenure (i.e., time since joining the site) at the time of the answer, • number of hyperlinks per answer, • binary value denoting whether the answer was eventually accepted (for voting only), • answer score before each vote, • default web page order for an answer (i.e., its relative position), • chronological order of an answer (whether it was first, second, third, etc.), • time since an answer was created, or its age • number of words per answer, • answer's word share, that is the fraction of total words in all answers to the question. ...
Preprint
Crowds can often make better decisions than individuals or small groups of experts by leveraging their ability to aggregate diverse information. Question answering sites, such as Stack Exchange, rely on the "wisdom of crowds" effect to identify the best answers to questions asked by users. We analyze data from 250 communities on the Stack Exchange network to pinpoint factors affecting which answers are chosen as the best answers. Our results suggest that, rather than evaluate all available answers to a question, users rely on simple cognitive heuristics to choose an answer to vote for or accept. These cognitive heuristics are linked to an answer's salience, such as the order in which it is listed and how much screen space it occupies. While askers appear to depend more on heuristics, compared to voting users, when choosing an answer to accept as the most helpful one, voters use acceptance itself as a heuristic: they are more likely to choose the answer after it is accepted than before that very same answer was accepted. These heuristics become more important in explaining and predicting behavior as the number of available answers increases. Our findings suggest that crowd judgments may become less reliable as the number of answers grow.
... Persuasive arguments are more diverse in root reply and full path, but the type-token ratio is surprisingly higher in root truncated: because of correlations with length and argument structure, lexical diversity is hard to interpret for texts of different lengths. Finally, we compute Flesch-Kincaid grade level [26] to represent readability. Although there is no significant difference in root reply, persuasive arguments are more complex in full path. ...
Preprint
Changing someone's opinion is arguably one of the most important challenges of social interaction. The underlying process proves difficult to study: it is hard to know how someone's opinions are formed and whether and how someone's views shift. Fortunately, ChangeMyView, an active community on Reddit, provides a platform where users present their own opinions and reasoning, invite others to contest them, and acknowledge when the ensuing discussions change their original views. In this work, we study these interactions to understand the mechanisms behind persuasion. We find that persuasive arguments are characterized by interesting patterns of interaction dynamics, such as participant entry-order and degree of back-and-forth exchange. Furthermore, by comparing similar counterarguments to the same opinion, we show that language factors play an essential role. In particular, the interplay between the language of the opinion holder and that of the counterargument provides highly predictive cues of persuasiveness. Finally, since even in this favorable setting people may not be persuaded, we investigate the problem of determining whether someone's opinion is susceptible to being changed at all. For this more difficult task, we show that stylistic choices in how the opinion is expressed carry predictive power.
... The readability of the English text was assessed using Flesh Kincaid Grade level (FKGL) (range 0-18), 17 Flesh Kincaid Reading Ease score (FRE) (range 0-100) 18 and Simple Measure of Gobbledygook Index (SMOG) (based on syllable count) 19 tools. A validated online readability calculator (readble.com) ...
Article
Full-text available
Objective This study investigates ChatGPT's accuracy, readability, understandability, and actionability in responding to patient queries on sudden sensorineural hearing loss (SSNHL) in English and Spanish, when compared to Google responses. The objective is to address concerns regarding its proficiency in addressing medical inquiries when presented in a language divergent from its primary programming. Study Design Observational. Setting Virtual environment. Methods Using ChatGPT 3.5 and Google, questions from the AAO‐HNSF guidelines were presented in English and Spanish. Responses were graded by 2 otolaryngologists proficient in both languages using a 4‐point Likert scale and the PEMAT‐P tool. To ensure uniform application of the Likert scale, a third independent evaluator reviewed the consistency in grading. Readability was evaluated using 3 different tools specific to each language. IBM SPSS Version 29 was used for statistical analysis using one‐way analysis of variance. Results Across both languages, the responses displayed a native‐level language proficiency. Accuracy was comparable between sources and languages. Google's Spanish responses had better readability (effect size 0.35, P < .001), while Google's English responses were more understandable (effect size 0.67, P = .018). ChatGPT's English responses demonstrated the highest level of actionability (60%), though not significantly different when compared to other sources (effect size 0.47, P = .14). Conclusion ChatGPT offers patients comprehensive and guideline‐conforming answers to SSNHL patient medical queries in the 2 most spoken languages in the United States. However, improvements in its readability and understandability are warranted for more accessible patient education.
... The idea underlying the method is the distinction between direct and indirect affective words. For direct affective words, we refer to the WordNet Affect (Strapparava & Valitutti, 2004) We use three indices to compute the difficulty of a text: the Gunning Fog (Gunning, 1952), Flesch (Flesch, 1946 and Kincaid (Kincaid et al., 1975) indices. These metrics combine factors such as word and sentence length that are easy to compute and approximate the linguistic elements that have an impact on readability. ...
Preprint
Urban legends are a genre of modern folklore, consisting of stories about rare and exceptional events, just plausible enough to be believed, which tend to propagate inexorably across communities. In our view, while urban legends represent a form of "sticky" deceptive text, they are marked by a tension between the credible and incredible. They should be credible like a news article and incredible like a fairy tale to go viral. In particular we will focus on the idea that urban legends should mimic the details of news (who, where, when) to be credible, while they should be emotional and readable like a fairy tale to be catchy and memorable. Using NLP tools we will provide a quantitative analysis of these prototypical characteristics. We also lay out some machine learning experiments showing that it is possible to recognize an urban legend using just these simple features.
... LL n is the most computationally intensive feature in our set. The Flesch-Kincaid grade (F-K) is another readability metric defined by Kincaid et al. (1975) and already used by Burel et al. (2012) in previous work. F-K is calculated as follows: ...
Preprint
Technical Q&A sites have become essential for software engineers as they constantly seek help from other experts to solve their work problems. Despite their success, many questions remain unresolved, sometimes because the asker does not acknowledge any helpful answer. In these cases, an information seeker can only browse all the answers within a question thread to assess their quality as potential solutions. We approach this time-consuming problem as a binary-classification task where a best-answer prediction model is built to identify the accepted answer among those within a resolved question thread, and the candidate solutions to those questions that have received answers but are still unresolved. In this paper, we report on a study aimed at assessing 26 best-answer prediction models in two steps. First, we study how models perform when predicting best answers in Stack Overflow, the most popular Q&A site for software engineers. Then, we assess performance in a cross-platform setting where the prediction models are trained on Stack Overflow and tested on other technical Q&A sites. Our findings show that the choice of the classifier and automated parameter tuning have a large impact on the prediction of the best answer. We also demonstrate that our approach to the best-answer prediction problem is generalizable across technical Q&A sites. Finally, we provide practical recommendations to Q&A platform designers to curate and preserve the crowdsourced knowledge shared through these sites.
... Based on our analysis, the BPI-SF requires a seventh grade reading level. 40 Given that the mean education level for both study groups was above US high school education they should have been able to complete it accurately. Consistent with this, word reading level between the groups was not significantly different. ...
Article
Full-text available
Introduction: Accurate assessment of pain severity is important for caring for patients with sickle cell disease (SCD). The Brief Pain Inventory was developed to address limitations of previous pain-rating metrics and is available in a short form (BPI-SF). However, the BPI-SF is a self-report scale dependent on patient comprehension and interpretation of items. Objective: To examine patterns in how patients completed the BPI-SF and determine whether incorrectly completing the BPI-SF was related to cognitive functioning or education. Methods: A secondary analysis was completed using data from a study examining brain aging and cognitive impairment in SCD. T-tests were performed to examine whether neurocognitive function (immediate and delayed memory, visuospatial skills, attention, and language), word reading, and years of education differed based on correct BPI-SF completion. Results: The sample (n = 71) was 43.7% male, 98.6% African American or mixed race. Of that, 53.5% had sickle cell anemia, and the mean years of education was 13.6. Overall, 21.1% of participants (n = 15) incorrectly completed the BPI-SF pain severity items, and 57.7% completed the body map item incorrectly. Those who completed the severity items incorrectly had statistically significant differences in education. Group differences in neurocognitive function were no longer significant after familywise error rates were controlled for. Literacy was not associated with error rates. Conclusion: Education level may influence patients' ability to correctly complete the BPI-SF. Findings suggest that careful consideration is warranted for use of the BPI in patients with SCD. Recommended revisions to the BPI include simplifying the language, shortening sentence length, and clearly specifying the timeframes.
... OOV words are skipped. • combination: n-grams, word2vec and readability features (these include length of post in words and characters, as well as the Flesch-Kincaid Grade level score [20]). ...
Preprint
Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lexical indicators (such as interjections and intensifiers), linguistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sarcastic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relationship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state-of-the-art textual features. The second method adapts a visual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.
... George Klare provided the original definition of readability [19] as "the ease of understanding or comprehension due to the style of writing". For measuring readability of Reddit comments, we use the so-called Flesch-Kincaid grade level [18] representing the readability of a piece of text by the number of years of education needed to understand the text upon first reading; it contrasts the number of words, sentences and syllables. It is defined as follows: 0.39 total words total sentences + 11.8 total syllables total words − 15.59 ...
Preprint
This article presents evidence of performance deterioration in online user sessions quantified by studying a massive dataset containing over 55 million comments posted on Reddit in April 2015. After segmenting the sessions (i.e., periods of activity without a prolonged break) depending on their intensity (i.e., how many posts users produced during sessions), we observe a general decrease in the quality of comments produced by users over the course of sessions. We propose mixed-effects models that capture the impact of session intensity on comments, including their length, quality, and the responses they generate from the community. Our findings suggest performance deterioration: Sessions of increasing intensity are associated with the production of shorter, progressively less complex comments, which receive declining quality scores (as rated by other users), and are less and less engaging (i.e., they attract fewer responses). Our contribution evokes a connection between cognitive and attention dynamics and the usage of online social peer production platforms, specifically the effects of deterioration of user performance.
Article
Full-text available
This experimental study investigates the potential impact of employing automatic speech recognition (ASR) and speech translation (ST) in consecutive interpreting (CI) through the use of a computer-assisted interpreting (CAI) tool. The tool used is Sight-Terp, an ASR-supported CAI tool developed and designed by the first author of this study. It offers multiple features, such as ASR, real-time ST, named entity highlighting, and automatically enumerated segmentation. The methodology adopted in this research involves a within-subjects design, assessing participants’ output in scenarios with and without the use of Sight-Terp on a tablet. Twelve participants were recruited for the experimental setup and asked to interpret four English speeches into Turkish in long CI mode, using Sight-Terp for two of them and a pen and paper for the other two. The data analysis is grounded on parameters of both accuracy and fluency. In distinguishing the variance in accuracy across the two settings, accuracy metrics were rooted in the mean count of correctly rendered semantic units (units of meaning), as defined by Seleskovitch (1989). On the other hand, fluency was quantified by tracking the frequency of disfluency markers, including false starts, frequency of filled pauses, filler words, whole-word repetitions, broken words, and incomplete phrases in each session. The results show that the integration of ASR into two CI tasks led to an enhancement in the accuracy of the participants’ rendition. Concurrently, however, it led to an increase in disfluencies and extended task durations compared to the tasks in which Sight-Terp was not used. The study outcomes also suggest potential areas of improvement and modifications that could further enhance the utility of the tool. Future empirical studies using Sight-Terp will tell us more about the feasibility of ASR in the interpreting process and cognitive aspects of human-machine interaction in CI.
Article
Experts in various fields routinely perform methodical writing tasks to plan, organize, and report their work. From a clinician writing a differential diagnosis for a patient, to a teacher writing a lesson plan for students, these tasks are pervasive, requiring to methodically generate structured long-form output for a given input. We develop a typology of methodical tasks structured in the form of a task objective, procedure, input, and output, and introduce DoLoMiTes, a novel benchmark with specifications for 519 such tasks elicited from hundreds of experts from across 25 fields. Our benchmark further contains specific instantiations of methodical tasks with concrete input and output examples (1,857 in total) which we obtain by collecting expert revisions of up to 10 model-generated examples of each task. We use these examples to evaluate contemporary language models, highlighting that automating methodical tasks is a challenging long-form generation problem, as it requires performing complex inferences, while drawing upon the given context as well as domain knowledge. Our dataset is available at https://dolomites-benchmark.github.io/.
Preprint
Creativity relates to the ability to generate novel and effective ideas in the areas of interest. How are such creative ideas generated? One possible mechanism that supports creative ideation and is gaining increased empirical attention is by asking questions. Question asking is a likely cognitive mechanism that allows defining problems, facilitating creative problem solving. However, much is unknown about the exact role of questions in creativity. This work presents an attempt to apply text mining methods to measure the cognitive potential of questions, taking into account, among others, (a) question type, (b) question complexity, and (c) the content of the answer. This contribution summarizes the history of question mining as a part of creativity research, along with the natural language processing methods deemed useful or helpful in the study. In addition, a novel approach is proposed, implemented, and applied to five datasets. The experimental results obtained are comprehensively analyzed, suggesting that natural language processing has a role to play in creative research.
Preprint
Full-text available
This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.
Article
Full-text available
Purpose The artificial intelligence (AI) chatbot ChatGPT has become a major tool for generating responses in healthcare. This study assessed ChatGPT’s ability to generate French preoperative patient-facing medical information (PFI) in rhinology at a comparable level to material provided by an academic source, the French Society of Otorhinolaryngology (Société Française d’Otorhinolaryngologie et Chirurgie Cervico-Faciale, SFORL). Methods ChatGPT and SFORL French preoperative PFI in rhinology were compared by analyzing responses to 16 questions regarding common rhinology procedures: ethmoidectomy, sphenoidotomy, septoplasty, and endonasal dacryocystorhinostomy. Twenty rhinologists assessed the clarity, comprehensiveness, accuracy, and overall quality of the information, while 24 nonmedical individuals analyzed the clarity and overall quality. Six readability formulas were used to compare readability scores. Results Among rhinologists, no significant difference was found between ChatGPT and SFORL regarding clarity (7.61 ± 0.36 vs. 7.53 ± 0.28; p = 0.485), comprehensiveness (7.32 ± 0.77 vs. 7.58 ± 0.50; p = 0.872), and accuracy (inaccuracies: 60% vs. 40%; p = 0.228), respectively. Non-medical individuals scored the clarity of ChatGPT significantly higher than that of the SFORL (8.16 ± 1.16 vs. 6.32 ± 1.33; p < 0.0001). The non-medical individuals chose ChatGPT as the most informative source significantly more often than rhinologists (62.8% vs. 39.7%, p < 0.001). Conclusion ChatGPT-generated French preoperative PFI in rhinology was comparable to SFORL-provided PFI regarding clarity, comprehensiveness, accuracy, readability, and overall quality. This study highlights ChatGPT’s potential to increase accessibility to high quality PFI and suggests its use by physicians as a complement to academic resources written by learned societies such as the SFORL.
Article
Full-text available
Background Patient-reported outcomes are essential to understanding success in plastic surgery procedures, many that aim to improve quality of life. Patient-reported outcome measures (PROMs) should be written at or below the sixth-grade reading level recommended by the American Medical Association. This study aimed to evaluate the readability of plastic surgery PROMs. Methods We conducted a literature review to identify validated, commonly used PROMs in plastic surgery. We extracted PROMs’ text and instructions and analyzed readability using different approaches that estimate the grade level required to understand. Our primary outcome was the Simple Measure of Gobbledygook (SMOG) index, which detects word complexity and expects 100% comprehension at the grade level rating assigned. We also included the Flesch-Kincaid grade level, Coleman-Liau index, and automated readability index. Results Forty-three PROMs met the inclusion criteria. The mean SMOG index was 8.2 (SD = 1.3), indicating an eighth-grade reading level. Mean reading grade levels measured by the Flesch-Kincaid grade level, Coleman-Liau index, and automated readability index ranged from third to sixth grade, although these may underestimate readability difficulties. Only 6 (14%) PROMs had a SMOG index at or below the sixth-grade level. PROM instructions had significantly higher reading levels than the questions/responses for all readability indexes ( P < 0.01). Conclusions PROMs used in plastic surgery, including the instructions, exceed the reading level recommended by the American Medical Association. This may limit comprehension and accurate completion and compromise validity and reliability. PROMs should be written and designed to be accessible to patients of all literacy levels.
Preprint
Full-text available
Delivering high-quality content is crucial for effective reading comprehension and successful learning. Ensuring educational materials are interpreted as intended by their authors is a persistent challenge, especially with the added complexity of multimedia and interactivity in the digital age. Authors must continuously revise their materials to meet learners' evolving needs. Detecting comprehension barriers and identifying actionable improvements within documents is complex, particularly in education where reading is fundamental. This study presents an analytical framework to help course designers enhance educational content to better support learning outcomes. Grounded in a robust theoretical foundation integrating learning analytics, reading comprehension, and content revision, our approach introduces usage-based document reengineering. This methodology adapts document content and structure based on insights from analyzing digital reading traces-interactions between readers and content. We define reading sessions to capture these interactions and develop indicators to detect comprehension challenges. Our framework enables authors to receive tailored content revision recommendations through an interactive dashboard, presenting actionable insights from reading activity. The proposed approach was implemented and evaluated using data from a European e-learning platform. Evaluations validate the framework's effectiveness, demonstrating its capacity to empower authors with data-driven insights for targeted revisions. The findings highlight the framework's ability to enhance educational content quality, making it more responsive to learners' needs. This research significantly contributes to learning analytics and content optimization, offering practical tools to improve educational outcomes and inform future developments in e-learning.
Article
Background/objective: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology. Methods: Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05. Results: Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3. Conclusions: Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
Article
To compare the quality and readability of patient education materials on myringotomy tubes from artificial intelligence and Google search. Three questions were posed to ChatGPT and Google Gemini addressing “Condition,” “Investigation,” and “Treatment” domains. Google was queried for “Ear tubes,” “Myringotomy and tubes,” and “Tympanostomy tubes.” Text quality was assessed using the DISCERN instrument. Readability was assessed using the Flesch-Kincaid Grade Level, Flesch-Kincaid Reading Ease scores, and the Fry Readability Graph. The average DISCERN score for websites was 52 (SD = 13.1, Median = 55.5), out of 80. The mean Flesch-Kincaid Reading Grade Level was 8 (SD = 3, Median = 7.1), and the mean Flesch-Kincaid Reading Ease score was 55 (SD = 12.3, Median = 57.7). ChatGPT and Google Gemini’s “Condition” responses each had DISCERN scores of 46, Flesch-Kincaid Grade Levels of 13.1 and 9.5, and Reading Ease scores of 41 and 61. For “Investigation,” DISCERN scores were 46 (ChatGPT) and 66 (Google Gemini), Grade Levels were 13.9 and 12.4, and Reading Ease scores were 38.9 and 34.9. For “Treatment,” ChatGPT and Google Gemini had DISCERN scores of 45 and 34, Grade Levels of 15.7 and 9.8, and Reading Ease scores of 36.2 and 53.9. Sites and artificial intelligence providing patient education material regarding myringotomy tubes are of “fair” quality but have readability levels above the recommended 6th grade level. Google search results were superior to artificial intelligence in readability.
ResearchGate has not been able to resolve any references for this publication.