Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Objective: The assessment of written medical examinations is a tedious and expensive process, requiring significant amounts of time from medical experts. Our objective was to develop a natural language processing (NLP) system that can expedite the assessment of unstructured answers in medical examinations by automatically identifying relevant concepts in the examinee responses. Materials and methods: Our NLP system, Intelligent Clinical Text Evaluator (INCITE), is semi-supervised in nature. Learning from a limited set of fully annotated examples, it sequentially applies a series of customized text comparison and similarity functions to determine if a text span represents an entry in a given reference standard. Combinations of fuzzy matching and set intersection-based methods capture inexact matches and also fragmented concepts. Customizable, dynamic similarity-based matching thresholds allow the system to be tailored for examinee responses of different lengths. Results: INCITE achieved an average F1-score of 0.89 (precision = 0.87, recall = 0.91) against human annotations over held-out evaluation data. Fuzzy text matching, dynamic thresholding and the incorporation of supervision using annotated data resulted in the biggest jumps in performances. Discussion: Long and non-standard expressions are difficult for INCITE to detect, but the problem is mitigated by the use of dynamic thresholding (i.e., varying the similarity threshold for a text span to be considered a match). Annotation variations within exams and disagreements between annotators were the primary causes for false positives. Small amounts of annotated data can significantly improve system performance. Conclusions: The high performance and interpretability of INCITE will likely significantly aid the assessment process and also help mitigate the impact of manual assessment inconsistencies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Kloehn et al. [24] generate explanations for complex medical terms in Spanish and English using WordNet synonyms and summaries, as well as the word embedding vector as a source of knowledge. Sarker et al. [25] used fuzzy logic and set theory-based methods for learning from a limited number of annotated examples of unstructured answers in health examinations for recognizing correct concepts with an average F1-measure of 0.89. Zhou et al. [26] used deep learning models pretrained on the general text sources to learn knowledge for information extraction from the medical texts. ...
... Kloehn et al. [24] Proposed a novel algorithm SubSimplify Improved quality in English and Spanish by providing multiword explanations for difficult terms ere is a possibility of the proposed model generating incomplete explanations Sarker et al. [25] Combination of fuzzy matching and intersection Increased accuracy against human annotations Inability to detect negations in expressions (5) ranking of candidate answers. e implementation of the diagnosis system framework was done using the Python language due to the following functionalities: cross-platform and high availability of thirdparty libraries for tasks relating to machine learning and NLP. e system uses Python library packages to access the machine learning functions and NLP needed for categorization. ...
Article
Full-text available
The use of natural language processing (NLP) methods and their application to developing conversational systems for health diagnosis increases patients' access to medical knowledge. In this study, a chatbot service was developed for the Covenant University Doctor (CUDoctor) telehealth system based on fuzzy logic rules and fuzzy inference. The service focuses on assessing the symptoms of tropical diseases in Nigeria. Telegram Bot Application Programming Interface (API) was used to create the interconnection between the chatbot and the system, while Twilio API was used for interconnectivity between the system and a short messaging service (SMS) subscriber. The service uses the knowledge base consisting of known facts on diseases and symptoms acquired from medical ontologies. A fuzzy support vector machine (SVM) is used to effectively predict the disease based on the symptoms inputted. The inputs of the users are recognized by NLP and are forwarded to the CUDoctor for decision support. Finally, a notification message displaying the end of the diagnosis process is sent to the user. The result is a medical diagnosis system which provides a personalized diagnosis utilizing self-input from users to effectively diagnose diseases. The usability of the developed system was evaluated using the system usability scale (SUS), yielding a mean SUS score of 80.4, which indicates the overall positive evaluation. Copyright © 2020 Nicholas A. I. Omoregbe et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
... On the other hand, inductive learning predicts the labels of unseen data using the labeled and unlabeled data provided during the training phase [12]. Semi-supervised learning methods have shown great success in areas such as image recognition and natural language processing [15][16][17][18][19], and it has been applied to a diverse set of problems in biomedical science including image classification [20][21][22][23][24] and medical language processing [25][26][27][28]. These methods have been applied to classification tasks using image and natural language data, which relies on spatial and semantic structure, i.e. the spatial correlations between pixels in images and sequential correlations between words in the text. ...
Article
Full-text available
Background Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification. Results This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls. Conclusions The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models.
... In this section, we introduce the entity disambiguation module, which is consisted of technologies of the named entity recognition and contrastive pre-training. As for the named entity recognition, we implement the method (Sarker et al., 2019) for recognizing the medical entities in the utterance. We achieve the accuracy of 90.9% in the simple medical entity recognition dataset of the IFLYTK 5 , which ranks top-3 of the competition. ...
Preprint
The medical conversational system can relieve the burden of doctors and improve the efficiency of healthcare, especially during the pandemic. This paper presents a medical conversational question answering (CQA) system based on the multi-modal knowledge graph, namely "LingYi", which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures including medical triage, consultation, image-text drug recommendation and record. To conduct knowledge-grounded dialogues with patients, we first construct a Chinese Medical Multi-Modal Knowledge Graph (CM3KG) and collect a large-scale Chinese Medical CQA (CMCQA) dataset. Compared with the other existing medical question-answering systems, our system adopts several state-of-the-art technologies including medical entity disambiguation and medical dialogue generation, which is more friendly to provide medical services to patients. In addition, we have open-sourced our codes which contain back-end models and front-end web pages at https://github.com/WENGSYX/LingYi. The datasets including CM3KG at https://github.com/WENGSYX/CM3KG and CMCQA at https://github.com/WENGSYX/CMCQA are also released to further promote future research.
... There are options of model-based and lexicon-based polarity detection approaches, and we leverage the latter for the following reasons. 1) There is increasing demand for interpretability in the field of NLP (Belinkov et al., 2020;Sarker et al., 2019), and the lexicon-based approach is more interpretable (provides token-level human interpretable annotation) compared to black-box neural models. 2) In the context of framing bias, distinguishing the subtle nuance of words between synonyms is crucial (e.g., dead vs. murdered). ...
Preprint
Full-text available
Media framing bias can lead to increased political polarization, and thus, the need for automatic mitigation methods is growing. We propose a new task, a neutral summary generation from multiple news headlines of the varying political spectrum, to facilitate balanced and unbiased news reading. In this paper, we first collect a new dataset, obtain some insights about framing bias through a case study, and propose a new effective metric and models for the task. Lastly, we conduct experimental analyses to provide insights about remaining challenges and future directions. One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content, but also politically biased content.
... Numerous studies have used NLP-based approaches to detect patterns in texts, such as for computational phenotyping from electronic health records, 9 detecting and disambiguating geographical entities, 10 finding measurements in radiology narratives 11 and detecting demographic information such as age and gender from patient notes. 12 While many studies employed manually-crafted patterns, others focused on creating specialized lexicons for entity recognition and extraction from noisy free texts, such as those from electronic health records and social media. [13][14][15] A major drawback of lexicon-based approaches is that lexicons are static and new lexicons need to be built for every new problem domain. ...
Article
Full-text available
Background Value sets are lists of terms (e.g., opioid medication names) and their corresponding codes from standard clinical vocabularies (e.g., RxNorm) created with the intent of supporting health information exchange and research. Value sets are manually-created and often exhibit errors. Objectives The aim of the study is to develop a semi-automatic, data-centric natural language processing (NLP) method to assess medication-related value set correctness and evaluate it on a set of opioid medication value sets. Methods We developed an NLP algorithm that utilizes value sets containing mostly true positives and true negatives to learn lexical patterns associated with the true positives, and then employs these patterns to identify potential errors in unseen value sets. We evaluated the algorithm on a set of opioid medication value sets, using the recall, precision and F1-score metrics. We applied the trained model to assess the correctness of unseen opioid value sets based on recall. To replicate the application of the algorithm in real-world settings, a domain expert manually conducted error analysis to identify potential system and value set errors. Results Thirty-eight value sets were retrieved from the Value Set Authority Center, and six (two opioid, four non-opioid) were used to develop and evaluate the system. Average precision, recall, and F1-score were 0.932, 0.904, and 0.909, respectively on uncorrected value sets; and 0.958, 0.953, and 0.953, respectively after manual correction of the same value sets. On 20 unseen opioid value sets, the algorithm obtained average recall of 0.89. Error analyses revealed that the main sources of system misclassifications were differences in how opioids were coded in the value sets—while the training value sets had generic names mostly, some of the unseen value sets had new trade names and ingredients. Conclusion The proposed approach is data-centric, reusable, customizable, and not resource intensive. It may help domain experts to easily validate value sets.
... In general, the medical dialogue methods can be divided into information retrieval-based methods and neural generative methods according to the types of the applied NLP techniques. The retrieval-based methods can be further classified into different subtypes, such as the entity inference [12,13], relation prediction [14,15], symptom matching and extraction [16,17], and slot filling [18][19][20]. However, the retrieval-based methods are not so intelligent and flexible that they required a well-defined user-built question and answer (Q&A) pool, which can offer different potential response to different kinds of answer. ...
Article
Full-text available
Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of researches have come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable medical knowledge-assisted mechanism (MKA) is proposed in this paper. The mechanism is aimed at assisting general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, and food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-BERT-GPT achieves state-of-the-art performance.
... The highlights shows the global attention between the related articles and the query article, and the local attention withing the query article. The weights are averaged and set to three scalar values, following [28], to make the visualization simple [29]. As depicted by Figure 1, the extraction of SARS-CoV-2 is because of the highly matched context about COVID-19 (the top related article) and the last sentence. ...
Article
Full-text available
The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double-check the decision-making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverages multi-task and self-supervised learning to improve generalization, data efficiency, and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by the micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero-and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
... The highlights shows the global attention between the related articles and the query article, and the local attention withing the query article. The weights are averaged and set to three scalar values, following [28], to make the visualization simple [29]. As depicted by Figure 1, the extraction of SARS-CoV-2 is because of the highly matched context about COVID-19 (the top related article) and the last sentence. ...
Preprint
Full-text available
The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double check the decision making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverage the multi-task and self-supervised learning to improve generalization, data efficiency and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero- and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
Article
Background: Residents receive infrequent feedback on their clinical reasoning (CR) documentation. While machine learning (ML) and natural language processing (NLP) have been used to assess CR documentation in standardized cases, no studies have described similar use in the clinical environment. Objective: The authors developed and validated using Kane's framework a ML model for automated assessment of CR documentation quality in residents' admission notes. Design, participants, main measures: Internal medicine residents' and subspecialty fellows' admission notes at one medical center from July 2014 to March 2020 were extracted from the electronic health record. Using a validated CR documentation rubric, the authors rated 414 notes for the ML development dataset. Notes were truncated to isolate the relevant portion; an NLP software (cTAKES) extracted disease/disorder named entities and human review generated CR terms. The final model had three input variables and classified notes as demonstrating low- or high-quality CR documentation. The ML model was applied to a retrospective dataset (9591 notes) for human validation and data analysis. Reliability between human and ML ratings was assessed on 205 of these notes with Cohen's kappa. CR documentation quality by post-graduate year (PGY) was evaluated by the Mantel-Haenszel test of trend. Key results: The top-performing logistic regression model had an area under the receiver operating characteristic curve of 0.88, a positive predictive value of 0.68, and an accuracy of 0.79. Cohen's kappa was 0.67. Of the 9591 notes, 31.1% demonstrated high-quality CR documentation; quality increased from 27.0% (PGY1) to 31.0% (PGY2) to 39.0% (PGY3) (p < .001 for trend). Validity evidence was collected in each domain of Kane's framework (scoring, generalization, extrapolation, and implications). Conclusions: The authors developed and validated a high-performing ML model that classifies CR documentation quality in resident admission notes in the clinical environment-a novel application of ML and NLP with many potential use cases.
Article
The capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R&D) approaches may be described as problem oriented, and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R&D efforts need to be broadened to a new research paradigm-one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems' suitability for POCRC (3 that represent NLP system properties and 1 associated with the R&D process)-(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize diverse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an example, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.
Article
The practice of medicine is changing rapidly as a consequence of electronic health record adoption, new technologies for patient care, disruptive innovations that breakdown professional hierarchies, and evolving societal norms. Collectively, these have resulted in the modification of the physician's role as the gatekeeper for health care, increased shift-based care, and amplified interprofessional team-based care. Technological innovations present opportunities as well as challenges. Artificial intelligence, which has great potential, has already transformed some tasks, particularly those involving image interpretation. Ubiquitous access to information via the internet by physicians and patients alike presents benefits as well as drawbacks: patients and providers have ready access to virtually all of human knowledge, but some websites are contaminated with misinformation and many people have difficulty differentiating between solid, evidence-based data and untruths. The role of the future physician will shift as complexity in health care increases and as artificial intelligence and other technologies advance. These technological advances demand new skills of physicians; memory and knowledge accumulation will diminish in importance while information management skills will become more important. In parallel, medical educators must enhance their teaching and assessment of critical human skills (e.g., clear communication, empathy) in the delivery of patient care. The authors emphasize the enduring role of critical human skills in safe and effective patient care even as medical practice is increasingly guided by artificial intelligence and related technology, and they suggest new and longitudinal ways of assessing essential non-cognitive skills to meet the demands of the future. The authors envision practical and achievable benefits accruing to patients and providers if practitioners leverage technological advancements to facilitate the development of their critical human skills.
Conference Paper
Full-text available
We propose a new shared task on grading student answers with the goal of enabling well-targeted and flexible feedback in a tutorial dialogue setting. We provide an annotated corpus designed for the purpose, a precise specification for a prediction task and an associated evaluation methodology. The task is feasible but non-trivial, which is demonstrated by creating and comparing three alternative baseline systems. We believe that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks.
Article
Full-text available
Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.
Conference Paper
Full-text available
Traditionally, automatic marking has been restricted to item types such as multiple choice that narrowly constrain how students may respond. More open ended items have generally been considered unsuitable for machine marking because of the difficulty of coping with the myriad ways in which credit-worthy answers may be expressed. Successful automatic marking of free text answers would seem to presuppose an advanced level of performance in automated natural language understanding. However, recent advances in computational linguistics techniques have opened up the possi-bility of being able to automate the marking of free text responses typed into a computer without having to create systems that fully understand the answers. This paper describes the use of information extraction and machine learning techniques in the marking of short, free text responses of up to around five lines.
Article
Full-text available
We present an approach to Computer-Assisted Assessment of free-text material based on symbolic analysis of student input. The theory that underlies this approach arises from previous work on DidaLect, a tutorial system for reading comprehension in French as a Second Language. The theory enables processing of a free-text segment for assessment to operate without precoded reference material. A study based on a small collection of student answers to several types of questions has justified our approach, and helped to define a methodology and design a prototype.
Article
Full-text available
This is a conference paper. Automated essay scoring with latent semantic analysis (LSA) has recently been subject to increasing interest. Although previous authors have achieved grade ranges similar to those awarded by humans, it is still not clear which and how parameters improve or decrease the effectiveness of LSA. This paper presents an analysis of the effects of these parameters, such as text pre-processing, weighting, singular value dimensionality and type of similarity measure, and benchmarks this effectiveness by comparing machine-assigned with human-assigned scores in a real-world case. We show that each of the identified factors significantly influences the quality of automated essay scoring and that the factors are not independent of each other.
Article
Full-text available
This is a conference paper. This paper describes and exemplifies an application of AutoMark, a software system developed in pursuit of robust computerised marking of free-text answers to open-ended questions. AutoMark employs the techniques of Information Extraction to provide computerised marking of short free-text responses. The system incorporates a number of processing modules specifically aimed at providing robust marking in the face of errors in spelling, typing, syntax, and semantics. AutoMark looks for specific content within free-text answers, the content being specified in the form of a number of mark scheme templates. Each template represents one form of a valid (or a specifically invalid) answer. Student answers are first parsed, and then intelligently matched against each mark scheme template, and a mark for each answer is computed. The representation of the templates is such that they can be robustly mapped to multiple variations in the input text. The current paper describes AutoMark for the first time, and presents the results of a brief quantitative and qualitative study of the performance of the system in marking a range of free-text responses in one of the most demanding domains: statutory national curriculum assessment of science for pupils at age 11. This particular domain has been chosen to help identify the strengths and weaknesses of the current system in marking responses where errors in spelling, syntax, and semantics are at their most frequent. Four items of varying degrees of open-endedness were selected from the 1999 tests. These items are drawn from the real-world of so-called ‘high stakes’ testing experienced by cohorts of over half a million pupils in England each year since 1995 at ages 11 and 14. A quantitative and qualitative study of the performance of the system is provided, together with a discussion of the potential for further development in reducing these errors. The aim of this exploration is to reveal some of the issues which need to be addressed if computerised marking is to play any kind of reliable role in the future development of such test regimes.
Article
Constructed response items can both measure the coherence of student ideas and serve as reflective experiences to strengthen instruction. We report on new automated scoring technologies that can reduce the cost and complexity of scoring constructed-response items. This study explored the accuracy of c-rater-ML, an automated scoring engine developed by Educational Testing Service, for scoring eight science inquiry items that require students to use evidence to explain complex phenomena. Automated scoring showed satisfactory agreement with human scoring for all test takers as well as specific subgroups. These findings suggest that c-rater-ML offers a promising solution to scoring constructed-response science items and has the potential to increase the use of these items in both instruction and assessment. © 2016 Wiley Periodicals, Inc. J Res Sci Teach
Article
Content-based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept-based scoring tool for content-based scoring, c-rater™, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed-response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept-based automated scoring.
Article
C-rater is an automated scoringengine that has been developed to scoreresponses to content-based short answerquestions. It is not simply a stringmatching program – instead it uses predicateargument structure, pronominal reference,morphological analysis and synonyms to assignfull or partial credit to a short answerquestion. C-rater has been used in two studies:National Assessment for Educational Progress(NAEP) and a statewide assessment in Indiana.In both studies, c-rater agreed with humangraders about 84% of the time.
Article
A framework for evaluation and use of automated scoring of constructed-response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high-stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.
Article
This paper presents a description and evaluation of SpeechRaterSM, a system for automated scoring of non-native speakers’ spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the highly unpredictable nature of the responses to this task type makes the challenge of accurate scoring much more difficult.SpeechRater uses a three-stage architecture. Responses are first processed by a filtering model to ensure that no exceptional conditions exist which might prevent them from being scored by SpeechRater. Responses not filtered out at this stage are then processed by the scoring model to estimate the proficiency rating which a human might assign to them, on the basis of features related to fluency, pronunciation, vocabulary diversity, and grammar. Finally, an aggregation model combines an examinee’s scores for multiple items to calculate a total score, as well as an interval in which the examinee’s score is predicted to reside with high confidence.SpeechRater’s current level of accuracy and construct representation have been deemed sufficient for low-stakes practice exercises, and it has been used in a practice exam for the TOEFL since late 2006. In such a practice environment, it offers a number of advantages compared to human raters, including system load management, and the facilitation of immediate feedback to students. However, it must be acknowledged that SpeechRater presently fails to measure many important aspects of speaking proficiency (such as intonation and appropriateness of topic development), and its agreement with human ratings of proficiency does not yet approach the level of agreement between two human raters.
Conference Paper
Automatic content scoring for free-text responses has started to emerge as an application of Natural Language Processing in its own right, much like question answering or machine translation. The task, in general, is reduced to comparing a student's answer to a model answer. Although a considerable amount of work has been done, common benchmarks and evaluation measures for this application do not currently exist. It is yet impossible to perform a comparative evaluation or progress tracking of this application across systems - an application that we view as a textual entailment task. This paper concentrates on introducing an Educational Testing Service-built test suite that makes a step towards establishing such a benchmark. The suite can be used as regression and performance evaluations both intra-c-rater ® or inter automatic content scoring technologies. It is important to note that existing textual entailment test suites like PASCAL RTE or FraCas, though beneficial, are not suitable for our purposes since we deal with atypical naturally-occurring student responses that need to be categorized in order to serve as regression test cases.
Article
Introduced the statistic kappa to measure nominal scale agreement between a fixed pair of raters. Kappa was generalized to the case where each of a sample of 30 patients was rated on a nominal scale by the same number of psychiatrist raters (n = 6), but where the raters rating 1 s were not necessarily the same as those rating another. Large sample standard errors were derived.
Article
In this paper we describe an algorithm called ConText for determining whether clinical conditions mentioned in clinical reports are negated, hypothetical, historical, or experienced by someone other than the patient. The algorithm infers the status of a condition with regard to these properties from simple lexical clues occurring in the context of the condition. The discussion and evaluation of the algorithm presented in this paper address the questions of whether a simple surface-based approach which has been shown to work well for negation can be successfully transferred to other contextual properties of clinical conditions, and to what extent this approach is portable among different clinical report types. In our study we find that ConText obtains reasonable to good performance for negated, historical, and hypothetical conditions across all report types that contain such conditions. Conditions experienced by someone other than the patient are very rarely found in our report set. A comprehensive solution to the problem of determining whether a clinical condition is historical or recent requires knowledge above and beyond the surface clues picked up by ConText.
Article
An essay-based discourse analysis system can help students improve their writing by identifying relevant essay-based discourse elements in their essays. Our discourse analysis software, which is embedded in Criterion, an online essay evaluation application, uses machine learning to identify discourse elements in student essays. The system makes decisions that exemplify how teachers perform this task. For instance, when grading student essays, teachers comment on the discourse structure. Teachers might explicitly state that the essay lacks a thesis statement or that an essay's single main idea has insufficient support. Training the systems to model this behavior requires human judges to annotate a data sample of student essays. The annotation schema reflects the highly structured discourse of genres such as persuasive writing. Our discourse analysis system uses a voting algorithm that takes into account the discourse labeling decisions of three independent systems.
A tale of two models: psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings
  • Engelhard
Automated writing evaluation: an expanding body of knowledge
  • Shermis