Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Objective: The assessment of written medical examinations is a tedious and expensive process, requiring significant amounts of time from medical experts. Our objective was to develop a natural language processing (NLP) system that can expedite the assessment of unstructured answers in medical examinations by automatically identifying relevant concepts in the examinee responses. Materials and methods: Our NLP system, Intelligent Clinical Text Evaluator (INCITE), is semi-supervised in nature. Learning from a limited set of fully annotated examples, it sequentially applies a series of customized text comparison and similarity functions to determine if a text span represents an entry in a given reference standard. Combinations of fuzzy matching and set intersection-based methods capture inexact matches and also fragmented concepts. Customizable, dynamic similarity-based matching thresholds allow the system to be tailored for examinee responses of different lengths. Results: INCITE achieved an average F1-score of 0.89 (precision = 0.87, recall = 0.91) against human annotations over held-out evaluation data. Fuzzy text matching, dynamic thresholding and the incorporation of supervision using annotated data resulted in the biggest jumps in performances. Discussion: Long and non-standard expressions are difficult for INCITE to detect, but the problem is mitigated by the use of dynamic thresholding (i.e., varying the similarity threshold for a text span to be considered a match). Annotation variations within exams and disagreements between annotators were the primary causes for false positives. Small amounts of annotated data can significantly improve system performance. Conclusions: The high performance and interpretability of INCITE will likely significantly aid the assessment process and also help mitigate the impact of manual assessment inconsistencies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A limitation on the resource side is that automatized assessment is often not feasible for these kinds of questions. Intensive manual evaluation is required, with human effort related to the need for expert knowledge in the field and a specific understanding of marking and grading criteria [8,9]. Given the repetitive nature of the task, manual assessment also is highly time-consuming and potentially error-prone [8,9]. ...
... Intensive manual evaluation is required, with human effort related to the need for expert knowledge in the field and a specific understanding of marking and grading criteria [8,9]. Given the repetitive nature of the task, manual assessment also is highly time-consuming and potentially error-prone [8,9]. ...
... By applying various terms, synonyms, and language concepts, NLP transfers unstructured language information into a standardized form [10] that can be used to build a decision-support system [11]. Previous findings have indicated possible applications of NLP in various tasks ranging from clinical [12][13][14] to educational settings [8,[15][16][17]. ...
Article
Full-text available
Background Written medical examinations consist of multiple-choice questions and/or free-text answers. The latter require manual evaluation and rating, which is time-consuming and potentially error-prone. We tested whether natural language processing (NLP) can be used to automatically analyze free-text answers to support the review process. Methods The European Board of Radiology of the European Society of Radiology provided representative datasets comprising sample questions, answer keys, participant answers, and reviewer markings from European Diploma in Radiology examinations. Three free-text questions with the highest number of corresponding answers were selected: Questions 1 and 2 were “unstructured” and required a typical free-text answer whereas question 3 was “structured” and offered a selection of predefined wordings/phrases for participants to use in their free-text answer. The NLP engine was designed using word lists, rule-based synonyms, and decision tree learning based on the answer keys and its performance tested against the gold standard of reviewer markings. Results After implementing the NLP approach in Python, F1 scores were calculated as a measure of NLP performance: 0.26 (unstructured question 1, n = 96), 0.33 (unstructured question 2, n = 327), and 0.5 (more structured question, n = 111). The respective precision/recall values were 0.26/0.27, 0.4/0.32, and 0.62/0.55. Conclusion This study showed the successful design of an NLP-based approach for automatic evaluation of free-text answers in the EDiR examination. Thus, as a future field of application, NLP could work as a decision-support system for reviewers and support the design of examinations being adjusted to the requirements of an automated, NLP-based review process. Clinical relevance statement Natural language processing can be successfully used to automatically evaluate free-text answers, performing better with more structured question-answer formats. Furthermore, this study provides a baseline for further work applying, e.g., more elaborated NLP approaches/large language models. Key points • Free-text answers require manual evaluation, which is time-consuming and potentially error-prone. • We developed a simple NLP-based approach — requiring only minimal effort/modeling — to automatically analyze and mark free-text answers. • Our NLP engine has the potential to support the manual evaluation process. • NLP performance is better on a more structured question-answer format. Graphical Abstract
... Therefore, in Step 2 Clinical Skills, examinees' written patient notes were assessed manually by experienced physician raters. More than 30,000 examinees took this examination each year, resulting in more than 330,000 patient notes that were graded by more than 100 raters (Sarker et al., 2019). The case-specific nature of the patient notes and large volume of exams make the human scoring process time-consuming and tedious. ...
... NLP has been applied to automatically process health documents, including assessing practical clinical content from patient notes (Latifi et al., 2016;Sarker et al., 2019). Specifically, patient notes after simulated patient encounters are required to contain specific information, which is specified by items in a checklist created through faculty consensus. ...
... Despite its importance, the task of automatic grading of patient notes remains under-explored with only a few works that have studied it (Yim et al., 2019;Sarker et al., 2019). Traditional supervised models have been utilized for this task (Latifi et al., 2016;Yim et al., 2019), but are limited in scope because they rely on large scale annotated datasets. ...
... 21 While use of ML and NLP to improve documentation and differential diagnosis generation have been suggested, there are limited reports of implementation in this domain. [17][18][19][21][22][23][24][25][26][27][28][29][30][31] To evaluate notes in the United States Medical Licensing Examination Step 2 Clinical Skills Exam, an NLP-based assessment was developed to detect presence of essential required concepts in notes pre-determined by a committee of physicians. 22,23 Similarly, Cianciolo et al. developed an NLP-based ML model to score medical student notes for standardized patient encounters. ...
... [17][18][19][21][22][23][24][25][26][27][28][29][30][31] To evaluate notes in the United States Medical Licensing Examination Step 2 Clinical Skills Exam, an NLP-based assessment was developed to detect presence of essential required concepts in notes pre-determined by a committee of physicians. 22,23 Similarly, Cianciolo et al. developed an NLP-based ML model to score medical student notes for standardized patient encounters. 31 To give feedback on differential diagnosis, Khumrin et al. developed a ML model that predicts the likelihood of a diagnosis on the basis of documented clinical observations. ...
... We developed and collected validity evidence with Kane's framework for an NLP-based ML model to automatically classify CR documentation quality in resident admission notes. This study goes beyond prior work using ML and NLP to assess CR documentation in standardized cases as our model is applied in the clinical environment with a wide range of chief concerns and is not dependent on a preset list of clinical information [22][23][24][25]31 -the first study to our knowledge to do so. We found at our institution low overall levels of CR documentation quality, similar to what has been widely reported. ...
Article
Background: Residents receive infrequent feedback on their clinical reasoning (CR) documentation. While machine learning (ML) and natural language processing (NLP) have been used to assess CR documentation in standardized cases, no studies have described similar use in the clinical environment. Objective: The authors developed and validated using Kane's framework a ML model for automated assessment of CR documentation quality in residents' admission notes. Design, participants, main measures: Internal medicine residents' and subspecialty fellows' admission notes at one medical center from July 2014 to March 2020 were extracted from the electronic health record. Using a validated CR documentation rubric, the authors rated 414 notes for the ML development dataset. Notes were truncated to isolate the relevant portion; an NLP software (cTAKES) extracted disease/disorder named entities and human review generated CR terms. The final model had three input variables and classified notes as demonstrating low- or high-quality CR documentation. The ML model was applied to a retrospective dataset (9591 notes) for human validation and data analysis. Reliability between human and ML ratings was assessed on 205 of these notes with Cohen's kappa. CR documentation quality by post-graduate year (PGY) was evaluated by the Mantel-Haenszel test of trend. Key results: The top-performing logistic regression model had an area under the receiver operating characteristic curve of 0.88, a positive predictive value of 0.68, and an accuracy of 0.79. Cohen's kappa was 0.67. Of the 9591 notes, 31.1% demonstrated high-quality CR documentation; quality increased from 27.0% (PGY1) to 31.0% (PGY2) to 39.0% (PGY3) (p < .001 for trend). Validity evidence was collected in each domain of Kane's framework (scoring, generalization, extrapolation, and implications). Conclusions: The authors developed and validated a high-performing ML model that classifies CR documentation quality in resident admission notes in the clinical environment-a novel application of ML and NLP with many potential use cases.
... (2) Explainability can create new use cases for legal outcome prediction models. Knowing how to reason towards a desired legal outcome is valuable for drafting legal arguments and, thus, a legal outcome prediction system could be deployed as an assistive technology in areas well beyond simply predicting outcomes (Sarker et al., 2019). ...
... In the wider NLP context, there is a considerable amount of research aimed at developing methods for unraveling what LMs know about language (Alain and Bengio, 2016;Shi et al., 2016;Ettinger et al., 2016;Bisazza and Tump, 2018;Liu et al., 2019;Pimentel et al., 2020). Sub-fields of NLP dealing with sensitive information, such as biomedical NLP (Garcia-Olano et al., 2021;Jha et al., 2018;Moradi and Samwald, 2021;Sarker et al., 2019;Mullenbach et al., 2018), have set out to build their own explainable AI models to address the concerns about applying a black-box model to sensitive data. ...
... For advance directives, their sensitivity was 69% and their PPV was 100%, and for mental status, their sensitivity was 100% and their PPV was 93%. Sarker et al [20] used a semisupervised NLP method to assess students' free-text notes. Their accuracy over 21 cases and 105 notes was a sensitivity of 0.91 and a PPV of 0.87. ...
... If NLP programs are to be used to automate the grading of students' notes, they must achieve an acceptable accuracy. Sarker et al [20] suggested that any method of scoring medical notes should achieve an accuracy close to 100%. Regrettably, none of the reported medical education NLPs achieved an acceptable accuracy. ...
Article
Full-text available
Background Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students’ free-text history and physical notes. Methods This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students’ notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74 ; P =.002). Conclusions ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students’ standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
... Developing a method to analyze UD quickly and efficiently in healthcare is important for patient outcomes. While LLMs easily and quickly analyze UD, they lack the context of understanding, and may be boosted by human intervention and interaction (Sarker et al., 2019). ...
... We will then compare the NLP findings with the medical record, and an independent healthcare provider who will complete an assessment. Similar work in health care and clinical research is currently underway (Sarker et al., 2019). As with all methods of data interpretation, NLP is limited in contextual understanding and requires copious amounts of data to provide more specific context, meaning that a diagnosis using NLP alone could be missed. ...
Article
Full-text available
Background Electronic health systems contain large amounts of unstructured data (UD) which are often unanalyzed due to the time and costs involved. Unanalyzed data creates missed opportunities to improve health outcomes. Natural language processing (NLP) is the foundation of generative artificial intelligence (GAI), which is the basis for large language models, such as ChatGPT. NLP and GAI are machine learning methods that analyze large amounts of data in a short time at minimal cost. The ability of NLP to conduct qualitative analyses is increasing, yet the results can lack context and nuance in their findings, requiring human intervention. Methods Our study compared outcomes, time, and costs of a previously published qualitative study. Our approach partnered an NLP model and a qualitative researcher (NLP+). UD from behavioral health patients were analyzed using NLP and a Latent Dirichlet allocation to identify the topics using probability of word coherence scores. The topics were then analyzed by a qualitative researcher, translated into themes, and compared with the original findings. Results The NLP + method results aligned with the original, qualitative derived themes. Our model also identified two additional themes which were not originally detected. The NLP + method required 6 hours of labor, 3 minutes for transcription, and a transcription cost of 1.17.Theoriginal,qualitativeresearcheronlymethodrequiredmorethan36hours(1.17. The original, qualitative researcher only method required more than 36 hours (2,250) of time and $1,100 for transcription. Conclusions While natural language processing analyzes voluminous amounts of data in seconds, context and nuance in human language are regularly missed. Combining a qualitative researcher with NLP + could be deployed in many settings, reducing time and costs, and improving context. Until large language models are more prevalent, a human interaction can help translate the patient experience by contextualizing data rich in social determinant indicators which may otherwise go unanalyzed.
... We use all-MiniLM-L6-v2 4 , which has been pretrained on 1B sentence pairs, as our backbone model for both SBERT-notraining and SBERT. Finally, the INCITE system (Sarker et al., 2019), which is specifically developed to score clinical text by capturing a variety of ways clinical concepts can be expressed. INCITE is a rule-based modular pipeline utilizing custombuilt lexicons, which contain observed misspellings for medical concepts and non-standard expressions, as well as common concepts and abbreviations from online resources. ...
... The tool performs direct and fuzzy matching between a new response and an annotated response (or a lexicon variant of it) using a fixed or dynamic Levenshtein ratio threshold (in our case -.95). Full details about the INCITE system are available in Sarker et al. (2019). ...
... The Intelligent Clinical Text Evaluator (INCITE) NLP driver was developed for computer-assisted scoring of patient notes [22]. At a very high level, INCITE scans the text entered in the patient note to identify if specific pre-defined concepts are present. ...
... The key essentials are the information most relevant to assessing a learner's clinical reasoning ability through the PN, and, as such, became the features relevant for the scoring of that case. A brief description of the INCITE engine follows; more granular detail can be found elsewhere [22]. ...
Article
Full-text available
In this op-ed, we discuss the advantages of leveraging natural language processing (NLP) in the assessment of clinical reasoning. Clinical reasoning is a complex competency that cannot be easily assessed using multiple-choice questions. Constructed-response assessments can more directly measure important aspects of a learner’s clinical reasoning ability, but substantial resources are necessary for their use. We provide an overview of INCITE, the Intelligent Clinical Text Evaluator, a scalable NLP-based computer-assisted scoring system that was developed to measure clinical reasoning ability as assessed in the written documentation portion of the now-discontinued USMLE Step 2 Clinical Skills examination. We provide the rationale for building a computer-assisted scoring system that is aligned with the intended use of an assessment. We show how INCITE’s NLP pipeline was designed with transparency and interpretability in mind, so that every score produced by the computer-assisted system could be traced back to the text segment it evaluated. We next suggest that, as a consequence of INCITE’s transparency and interpretability features, the system may easily be repurposed for formative assessment of clinical reasoning. Finally, we provide the reader with the resources to consider in building their own NLP-based assessment tools.
... There has been fragmented effort by individual institutions to train in-house NLP systems for clinical text scoring, with no fully transparent evaluation on public data (Luck et al., 2006;Spickard III et al., 2014;Latifi et al., 2016;Sarker et al., 2019). This has raised questions from a key stakeholder -the medical student community -about potential algorithmic bias and its implications for fairness (Spadafore and Monrad, 2019). ...
... There is often a need to map concepts by combining multiple text segments, or resolve ambiguous negation as in no cold intolerance, hair loss, palpitations or tremor corresponding to the feature lack of other thyroid symptoms. In addition, automated scoring systems should employ a dynamic threshold to determine whether a given feature has been found in a PN, i.e., whether the F1 score for a given identified phrase is high enough for the phrase to be considered a match (Sarker et al., 2019). Finally, to be comparable to human rater performance and thus operationally usable, such systems need to be highly accurate. ...
... Kloehn et al. [24] generate explanations for complex medical terms in Spanish and English using WordNet synonyms and summaries, as well as the word embedding vector as a source of knowledge. Sarker et al. [25] used fuzzy logic and set theory-based methods for learning from a limited number of annotated examples of unstructured answers in health examinations for recognizing correct concepts with an average F1-measure of 0.89. Zhou et al. [26] used deep learning models pretrained on the general text sources to learn knowledge for information extraction from the medical texts. ...
... Kloehn et al. [24] Proposed a novel algorithm SubSimplify Improved quality in English and Spanish by providing multiword explanations for difficult terms ere is a possibility of the proposed model generating incomplete explanations Sarker et al. [25] Combination of fuzzy matching and intersection Increased accuracy against human annotations Inability to detect negations in expressions (5) ranking of candidate answers. e implementation of the diagnosis system framework was done using the Python language due to the following functionalities: cross-platform and high availability of thirdparty libraries for tasks relating to machine learning and NLP. e system uses Python library packages to access the machine learning functions and NLP needed for categorization. ...
Article
Full-text available
The use of natural language processing (NLP) methods and their application to developing conversational systems for health diagnosis increases patients' access to medical knowledge. In this study, a chatbot service was developed for the Covenant University Doctor (CUDoctor) telehealth system based on fuzzy logic rules and fuzzy inference. The service focuses on assessing the symptoms of tropical diseases in Nigeria. Telegram Bot Application Programming Interface (API) was used to create the interconnection between the chatbot and the system, while Twilio API was used for interconnectivity between the system and a short messaging service (SMS) subscriber. The service uses the knowledge base consisting of known facts on diseases and symptoms acquired from medical ontologies. A fuzzy support vector machine (SVM) is used to effectively predict the disease based on the symptoms inputted. The inputs of the users are recognized by NLP and are forwarded to the CUDoctor for decision support. Finally, a notification message displaying the end of the diagnosis process is sent to the user. The result is a medical diagnosis system which provides a personalized diagnosis utilizing self-input from users to effectively diagnose diseases. The usability of the developed system was evaluated using the system usability scale (SUS), yielding a mean SUS score of 80.4, which indicates the overall positive evaluation. Copyright © 2020 Nicholas A. I. Omoregbe et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
... Prior approaches to automated grading of text documents at other institutions have relied primarily on conventional supervised machine learning, the performance of which hinged on adequately sized training datasets and sufficient examples of student responses graded by expert humans. [11][12][13][14][15][16][17][18] However, the recent release of advanced frontier large language models (LLMs), such as Generative Pretrained Transformer (GPT) 4, 19 has enabled a quantum leap in creating automated grading systems. Indeed, at least one institution has reported a retrospective analysis of ChatGPT (GPT-3.5)-based ...
Article
This case study, conducted at UT Southwestern Medical Center’s Simulation Center, describes the first successful prospective deployment of a generative artificial intelligence (AI)–based automated grading system for medical student post-encounter Objective Structured Clinical Examination (OSCE) notes. The OSCE is a standard approach to measuring the competence of medical students by their participation in live-action, simulated patient encounters with human actors. The post-encounter learner note is a vital element of the OSCE, and accurate assessment of student performance requires specially trained manual evaluators, which imposes significant labor and time investments. The Simulation Center at UT Southwestern provides a compelling platform for observing the benefits and challenges of AI-based enhancements in medical education at scale. To that end, we prospectively activated a first-pass AI grading system at the center for 245 (preclerkship) medical students participating in a 10-station fall 2023 OSCE session. Our inaugural deployment of the AI notes grading system reduced human effort by an estimated 91% (as measured by gradable items) and dramatically reduced turnaround time (from weeks to days). Conceived as a zero-shot large language model architecture with minimal prompt engineering, the system requires no prior domain-specific training data and can be readily adapted for new evaluation rubrics, opening the door to scaling this approach to other institutions. Confidence in our zero-shot Generative Pretrained Transformer 4 (GPT-4) framework was established by pre-deployment of retrospective evaluations. With the OSCE in prior years, the system achieved up to 89.7% agreement with human expert graders at the rubric item level (Cohen’s kappa, 0.79) and a Spearman’s correlation of 0.86 with the total examination score. We also demonstrate that local, smaller, open-source models (such as Llama-2-7B) can be fine-tuned via knowledge distillation from frontier models like GPT-4 to achieve similar performance, thereby indicating important operational implications for scalability, data privacy, security, and model control. These achievements were the result of a strategic, multiyear effort to pivot toward AI that was begun prior to ChatGPT’s release. In addition to highlighting the model’s performance and capabilities (including a retrospective analysis of 1124 students, 10,175 post-encounter notes, and 156,978 scored items), we share observations on the development and sign-off prior to the launch of an AI deployment protocol for our program. (Funded by UT Southwestern institutional funds and others.)
... As a future direction and application, the integration of ChatGPT into drug delivery research not only transforms current methodologies but also paves the way for future innovations and advancements. Their potential applications in ongoing and forthcoming research initiatives hold promising avenues for the pharmaceutical industry [27]. Enhanced Drug Targeting and Delivery Systems: As research progresses, ChatGPT could play an instrumental role in refining drug targeting strategies and developing more precise delivery systems. ...
Article
Full-text available
This study aims to delineate the pivotal role of ChatGPT, an Artificial intelligence-driven (AI) language model, in revolutionizing drug delivery research within the pharmaceutical sciences domain. The investigation adopted a structured approach involving systematic literature exploration across databases such as PubMed, ScienceDirect, IEEE Xplore, and Google Scholar. A selection criterion emphasizing peer-reviewed articles, conference proceedings, patents, and seminal texts highlights the integration of AI-driven chatbots, specifically ChatGPT, into various facets of drug delivery research and development. ChatGPT exhibits multifaceted contributions to drug delivery innovation, streamlining drug formulation optimization, predictive modeling, regulatory compliance, and fostering patient-centric approaches. Real-world case studies have underscored its efficacy in expediting drug development timelines and enhancing research efficiency. This paper delves into the diverse applications of ChatGPT, showcasing its potential across drug delivery systems. It elucidates its capabilities in accelerating research phases, facilitating formulation development, predictive modeling for efficacy and safety, and simplifying regulatory compliance. This discussion outlines the transformative impact of ChatGPT in reshaping drug delivery methodologies. In conclusion, ChatGPT, an AI-driven chatbot, has emerged as a transformative tool in pharmaceutical research. Their integration expedites drug development pipelines, ensures effective drug delivery solutions, and augments healthcare advancements. Embracing AI tools such as ChatGPT has become pivotal in evolving drug delivery methodologies for global patient welfare.
... Their accuracy over 21 If NLP programs are to be used to automate the grading of students' notes, they must achieve an acceptable accuracy. Sarker et al. 17 suggested that any method of scoring medical notes should achieve an accuracy close to 100%. Regrettably, none of the reported medical education NLPs achieved an acceptable accuracy. ...
Preprint
Full-text available
Background Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. The objective of this project is to assess the ability of ChatGPT 3.5 (ChatGPT) to score medical students’ free text history and physical notes. Methods This is a single institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free text history and physical note of their interaction. ChatGPT is a large language model (LLM). The students’ notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results The study population consisted of 168 first year medical students. There was a total of 14,280 scores. The standardized patient incorrect scoring rate (error) was 7.2% and the ChatGPT incorrect scoring rate was 1.0%. The ChatGPT error rate was 86% lower than the standardized patient error rate. The standardized patient mean incorrect scoring rate of 85 (SD 74) was significantly higher than the ChatGPT mean incorrect scoring rate of 12 (SD 11), p = 0.002. Conclusions ChatGPT had a significantly lower error rate than the standardized patients. This suggests that an LLM can be used to score medical students’ notes. Furthermore, it is expected that, in the near future, LLM programs will provide real time feedback to practicing physicians regarding their free text notes. Generative pretrained transformer artificial intelligence programs represent an important advance in medical education and in the practice of medicine.
... For example, studies have leveraged NLP of clinical notes for studying chronic diseases, 4 extracting critical HIV and cardiovascular risk information, 5,6 analyzing critical limb ischemia, 7 and extracting ad hoc concepts. 8 An effective NLP system, which can automatically identify Fontan cases from text notes in EHRs, will help improve the efficiency of creating Fontan cohorts, and hence, provide an additional method of conducting healthrelated studies on Fontan cases. ...
Article
Full-text available
Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases ( ICD ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. Methods and Results We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F 1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F 1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance ( P <0.05), and both natural language processing models outperformed ICD code–based classification ( P <0.05). The sliding window strategy improved performance over the base model ( P <0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. Conclusions Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.
... More complex approaches that make greater use of NLP technology include c-rater -also developed by Educational Testing Service (Leacock & Chodorow, 2003;Liu et al., 2014;Sukka rieh & Blackmore, 2009) -and INCITE, the procedure that is the focus of this chapter (Sarker et al., 2019). Again, the approaches used in these systems are designed to identify specific scora ble concepts in the response. ...
... NLP, a subdiscipline of computer science concerned with the use of computers to process information from human language, has been widely applied to clinical notes for creating healthcare applications. For example, studies have leveraged NLP of clinical notes for studying chronic diseases 3 , extracting critical human immunodeficiency virus and cardiovascular risk information 4,5 , analyzing critical limb ischemia 6 , and extracting ad-hoc concepts 7 . An effective NLP system, which can automatically identify Fontan cases from text notes in . ...
Preprint
Full-text available
Background The Fontan operation palliates single ventricle heart defects and is associated with significant morbidity and premature mortality. Native anatomy varies; thus, Fontan cases cannot always be identified by International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD-9-CM and ICD-10-CM) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing (NLP) based machine learning (ML) models, which utilize free text notes of patients, to automatically detect Fontan cases, and compare their performances with ICD code based classification. Methods and Results We included free text notes of 10,935 manually validated patients, of whom 778 (7.1%) were Fontan and 10,157 (92.9%) non-Fontan patients, from two large, diverse healthcare systems. Using 5-fold cross validation, we trained and evaluated multiple ML models, namely support vector machines (SVM) and a transformer based model for language understanding named RoBERTa (2 versions), for automatically identifying Fontan cases based on free text notes. To optimize classifier performances, we experimented with different text representation techniques, including a sliding window strategy to overcome the length limit imposed by RoBERTa. We compared the performances of the ML models to ICD code based classification using the F 1 score metric. The ICD classification model, SVM, and RoBERTa achieved F 1 scores of 0.81 (95% CI: 0.79-0.83), 0.95 (95% CI: 0.92-0.97), and 0.89 (95% CI: 0.88-0.85) for the positive (Fontan) class, respectively. SVM obtained the best performance ( p <0.05), and both NLP models outperformed ICD code based classification ( p<0.05 ). The novel sliding window strategy improved performance over the base RoBERTa model ( p<0.05 ) but did not outperform SVM. ICD code based classification tended to have more false positives compared to both NLP models. Conclusions Our proposed NLP models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes. Since the sensitivity of ICD codes is high but the positive predictive value is low, it may be beneficial to apply ICD codes as a filter prior to applying NLP/ML to achieve optimal performance.
... We established an App-based e-training platform using the lean thinking in order to improve resident training assessment. To be specific, we used natural language processing (NLP) technology [23][24][25][26] to extract features and convert them to a normalized data structure. The department of education provided training on how to organize the resident assessment. ...
Article
Full-text available
Background The assessment system for standardized resident training is crucial for developing competent doctors. However, it is complex, making it difficult to manage. The COVID-19 pandemic has also aggravated the difficulty of assessment. We, therefore, integrated lean thinking with App-based e-training platform to improve the assessment process through Define–Measure–Analyze–Improve–Control (DMAIC) cycles. This was designed to avoid unnecessary activities that generate waste. Methods Panels and online surveys were conducted in 2021–2022 to find the main issues that affect resident assessment and the root causes under the frame of waste. An online app was developed. Activities within the process were improved by brainstorming. Online surveys were used to improve the issues, satisfaction, and time spent on assessment using the app. Results A total of 290 clinical educators in 36 departments responded to the survey, and 153 clinical educators used the online app for assessment. Unplanned delay or cancellation was defined as the main issue. Eleven leading causes accounted for 87.5% of the issues. These were examiner time conflict, student time conflict, insufficient examiners, supervisor time conflict, grade statistics, insufficient exam assistants, reporting results, material archiving, unfamiliarity with the process, uncooperative patients, and feedback. The median rate of unplanned delay or cancellation was lower with use of the app (5% vs 0%, P < 0.001), and satisfaction increased (P < 0.001). The median time saved by the app across the whole assessment process was 60 (interquartile range 60–120) minutes. Conclusions Lean thinking integrated with an App-based e-training platform could optimize the process of resident assessment. This could reduce waste and promote teaching and learning in medical education.
... Several types of information-retrieval based methods have been used to develop medical dialog systems. Medical Dialog consulting system based on entity inference [2,3], slot filling [4,5], symptom matching and extraction [6,7] and relation prediction [8,9]. Consequently, the neural network models have shown to be more accurate in learning different doctor-like responses. ...
Article
Full-text available
The world was taken aback when the Covid-19 pandemic hit in 2019. Ever since precautions have been taken to prevent the spreading or mutating of the virus, but the virus still keeps spreading and mutating. Scientists predict that the virus is going to stay for a long time but with reduced effectiveness. Recognizing the symptoms of the virus is essential in order to provide proper treatment for the virus. Visiting hospitals for consultation becomes quite difficult when people are supposed to maintain social distancing. Recently neural network generative models have shown impressive abilities in developing chatbots. However, using these neural network generative models that lack the required Covid specific knowledge to develop a Covid consulting system makes them difficult to be scaled. In order to bridge the gap between patients and a limited number of doctors we have proposed a Covid consulting agent by integrating the medical knowledge of Covid-19 with the neural network generative models. This system will automatically scan patient's dialogues seeking for a consultation to recognize the symptoms for Covid-19. The transformer and pretrained systems of BERT-GPT and GPT were fine-tuned CovidDialog-English dataset to generate responses for Covid-19 which were doctor-like and clinically meaningful to further solve the problem of the surging demand for medical consultations compared to the limited number of medical professionals. The results are evaluated and compared using multiple evaluation metrics which are NIST-n, perplexity, BLEU-n, METEOR, Entropy-n and Dist-n. In this paper, we also hope to prove that the results obtained from the automated dialogue systems were significantly similar to human evaluation. Furthermore, the evaluation shows that state-of-the-art BERT-GPT performs better.
... First of all, we propose to build our metric based on the fact that framing bias is closely associated with polarity. Both model-based and lexicon-based polarity detection approaches are options for our work, and we leverage the latter for the following reasons: 1) There is increasing demand for interpretability in the field of NLP (Belinkov et al., 2020;Sarker et al., 2019), and the lexicon-based approach is more interpretable (provides token-level human interpretable annotation) compared to blackbox neural models. 2) In the context of framing bias, distinguishing the subtle nuance of words between synonyms is crucial (e.g., dead vs. murdered). ...
... On the other hand, inductive learning predicts the labels of unseen data using the labeled and unlabeled data provided during the training phase [12]. Semi-supervised learning methods have shown great success in areas such as image recognition and natural language processing [15][16][17][18][19], and it has been applied to a diverse set of problems in biomedical science including image classification [20][21][22][23][24] and medical language processing [25][26][27][28]. These methods have been applied to classification tasks using image and natural language data, which relies on spatial and semantic structure, i.e. the spatial correlations between pixels in images and sequential correlations between words in the text. ...
Article
Full-text available
Background Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification. Results This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls. Conclusions The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models.
... In this section, we introduce the entity disambiguation module, which is consisted of technologies of the named entity recognition and contrastive pre-training. As for the named entity recognition, we implement the method (Sarker et al., 2019) for recognizing the medical entities in the utterance. We achieve the accuracy of 90.9% in the simple medical entity recognition dataset of the IFLYTK 5 , which ranks top-3 of the competition. ...
Preprint
The medical conversational system can relieve the burden of doctors and improve the efficiency of healthcare, especially during the pandemic. This paper presents a medical conversational question answering (CQA) system based on the multi-modal knowledge graph, namely "LingYi", which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures including medical triage, consultation, image-text drug recommendation and record. To conduct knowledge-grounded dialogues with patients, we first construct a Chinese Medical Multi-Modal Knowledge Graph (CM3KG) and collect a large-scale Chinese Medical CQA (CMCQA) dataset. Compared with the other existing medical question-answering systems, our system adopts several state-of-the-art technologies including medical entity disambiguation and medical dialogue generation, which is more friendly to provide medical services to patients. In addition, we have open-sourced our codes which contain back-end models and front-end web pages at https://github.com/WENGSYX/LingYi. The datasets including CM3KG at https://github.com/WENGSYX/CM3KG and CMCQA at https://github.com/WENGSYX/CMCQA are also released to further promote future research.
... There are options of model-based and lexicon-based polarity detection approaches, and we leverage the latter for the following reasons. 1) There is increasing demand for interpretability in the field of NLP (Belinkov et al., 2020;Sarker et al., 2019), and the lexicon-based approach is more interpretable (provides token-level human interpretable annotation) compared to black-box neural models. 2) In the context of framing bias, distinguishing the subtle nuance of words between synonyms is crucial (e.g., dead vs. murdered). ...
Preprint
Full-text available
Media framing bias can lead to increased political polarization, and thus, the need for automatic mitigation methods is growing. We propose a new task, a neutral summary generation from multiple news headlines of the varying political spectrum, to facilitate balanced and unbiased news reading. In this paper, we first collect a new dataset, obtain some insights about framing bias through a case study, and propose a new effective metric and models for the task. Lastly, we conduct experimental analyses to provide insights about remaining challenges and future directions. One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content, but also politically biased content.
... Numerous studies have used NLP-based approaches to detect patterns in texts, such as for computational phenotyping from electronic health records, 9 detecting and disambiguating geographical entities, 10 finding measurements in radiology narratives 11 and detecting demographic information such as age and gender from patient notes. 12 While many studies employed manually-crafted patterns, others focused on creating specialized lexicons for entity recognition and extraction from noisy free texts, such as those from electronic health records and social media. [13][14][15] A major drawback of lexicon-based approaches is that lexicons are static and new lexicons need to be built for every new problem domain. ...
Article
Full-text available
Background Value sets are lists of terms (e.g., opioid medication names) and their corresponding codes from standard clinical vocabularies (e.g., RxNorm) created with the intent of supporting health information exchange and research. Value sets are manually-created and often exhibit errors. Objectives The aim of the study is to develop a semi-automatic, data-centric natural language processing (NLP) method to assess medication-related value set correctness and evaluate it on a set of opioid medication value sets. Methods We developed an NLP algorithm that utilizes value sets containing mostly true positives and true negatives to learn lexical patterns associated with the true positives, and then employs these patterns to identify potential errors in unseen value sets. We evaluated the algorithm on a set of opioid medication value sets, using the recall, precision and F1-score metrics. We applied the trained model to assess the correctness of unseen opioid value sets based on recall. To replicate the application of the algorithm in real-world settings, a domain expert manually conducted error analysis to identify potential system and value set errors. Results Thirty-eight value sets were retrieved from the Value Set Authority Center, and six (two opioid, four non-opioid) were used to develop and evaluate the system. Average precision, recall, and F1-score were 0.932, 0.904, and 0.909, respectively on uncorrected value sets; and 0.958, 0.953, and 0.953, respectively after manual correction of the same value sets. On 20 unseen opioid value sets, the algorithm obtained average recall of 0.89. Error analyses revealed that the main sources of system misclassifications were differences in how opioids were coded in the value sets—while the training value sets had generic names mostly, some of the unseen value sets had new trade names and ingredients. Conclusion The proposed approach is data-centric, reusable, customizable, and not resource intensive. It may help domain experts to easily validate value sets.
... In general, the medical dialogue methods can be divided into information retrieval-based methods and neural generative methods according to the types of the applied NLP techniques. The retrieval-based methods can be further classified into different subtypes, such as the entity inference [12,13], relation prediction [14,15], symptom matching and extraction [16,17], and slot filling [18][19][20]. However, the retrieval-based methods are not so intelligent and flexible that they required a well-defined user-built question and answer (Q&A) pool, which can offer different potential response to different kinds of answer. ...
Article
Full-text available
Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of researches have come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable medical knowledge-assisted mechanism (MKA) is proposed in this paper. The mechanism is aimed at assisting general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, and food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-BERT-GPT achieves state-of-the-art performance.
... The highlights shows the global attention between the related articles and the query article, and the local attention withing the query article. The weights are averaged and set to three scalar values, following [28], to make the visualization simple [29]. As depicted by Figure 1, the extraction of SARS-CoV-2 is because of the highly matched context about COVID-19 (the top related article) and the last sentence. ...
Article
Full-text available
The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double-check the decision-making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverages multi-task and self-supervised learning to improve generalization, data efficiency, and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by the micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero-and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
... The highlights shows the global attention between the related articles and the query article, and the local attention withing the query article. The weights are averaged and set to three scalar values, following [28], to make the visualization simple [29]. As depicted by Figure 1, the extraction of SARS-CoV-2 is because of the highly matched context about COVID-19 (the top related article) and the last sentence. ...
Preprint
Full-text available
The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double check the decision making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverage the multi-task and self-supervised learning to improve generalization, data efficiency and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero- and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
Article
Full-text available
Purpose The study aims to identify the status quo of artificial intelligence in entrepreneurship education with a view to identifying potential research gaps, especially in the adoption of certain intelligent technologies and pedagogical designs applied in this domain. Design/methodology/approach A scoping review was conducted using six inclusive and exclusive criteria agreed upon by the author team. The collected studies, which focused on the adoption of AI in entrepreneurship education, were analysed by the team with regards to various aspects including the definition of intelligent technology, research question, educational purpose, research method, sample size, research quality and publication. The results of this analysis were presented in tables and figures. Findings Educators introduced big data and algorithms of machine learning in entrepreneurship education. Big data analytics use multimodal data to improve the effectiveness of entrepreneurship education and spot entrepreneurial opportunities. Entrepreneurial analytics analysis entrepreneurial projects with low costs and high effectiveness. Machine learning releases educators’ burdens and improves the accuracy of the assessment. However, AI in entrepreneurship education needs more sophisticated pedagogical designs in diagnosis, prediction, intervention, prevention and recommendation, combined with specific entrepreneurial learning content and entrepreneurial procedure, obeying entrepreneurial pedagogy. Originality/value This study holds significant implications as it can shift the focus of entrepreneurs and educators towards the educational potential of artificial intelligence, prompting them to consider the ways in which it can be used effectively. By providing valuable insights, the study can stimulate further research and exploration, potentially opening up new avenues for the application of artificial intelligence in entrepreneurship education.
Article
Full-text available
This study used Kaggle data, the ASAP data set, and applied NLP and Bidirectional Encoder Representations from Transformers (BERT) for corpus processing and feature extraction, and applied different machine learning models, both traditional machine-learning classifiers and neural-network-based approaches. Supervised learning models were used for the scoring system, where six out of the eight essay prompts were trained separately and concatenated. Compared with previous study, we found that adding more features such as readability scores using Spacy Textsta improved the prediction results for the essay scoring system. The neural network model, trained on all prompt data and utilizing NLP for corpus processing and feature extraction, performed better than other models with an overall test quadratic weighted kappa (QWK) of 0.9724. It achieved the highest QWK score of 0.859 for prompt 1 and an average QWK of 0.771 across all 6 prompts, making it the best-performing machine learning model that was tested.
Article
Full-text available
本研究利用特征提取与机器学习方法分析 Kaggle 数据,即 ASAP 数据集。具体而言,应用自然语言处理(Natural Language Processing, NLP)和双向编码表示转换模型 (Bidirectional Encoder Representations from Transformers, BERT)进行语料处理和特征提取,并涵盖不同的机器学习模型,包括传统的机器学习分类器和基于神经网络的方法。 对评分系统使用有监督学习模型,对其中 6/8 的写作指令(prompt)进行单独训练或同 时训练。与已有研究相比,本研究发现:(1)增加特征的数量(如使用 Spacy Textsta 的 易读性得分)能够提高作文评分系统的预测能力;(2)使用 NLP 进行语料处理和特征提 取的神经网络模型,同时训练所有写作指令时表现优于其他模型,整体二次加权 Kappa 系数(QWK)为 0.9724。其中,写作指令 1 的 QWK 最高,具体为 0.859,所有 6 个写 作指令的平均 QWK 为 0.771。
Article
Purpose: Scoring post-encounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning (ML) allow the application of automated short answer grading (ASAG) for this task. This retroactive study evaluates the psychometric characteristics and reliability of an ASAG system for PNs, and secondarily considered factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case. Method: PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-non-faculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa. Results: ASAG-scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine-human ratings. Across all items the ASAG-faculty scoring mean kappa was 0.83 (SE±0.02). The ASAG-non-faculty pairings kappa was 0.83 (SE±0.02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of 0.65 to 0.77. Faculty time cost to train and supervise non-faculty raters for 4 cases is approximately 1,856.FacultycosttotunetheASAGsystemisapproximately1,856. Faculty cost to tune the ASAG system is approximately 928. Conclusions: NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.
Article
Full-text available
Health examination is an examination that focuses more on primary and secondary prevention efforts, namely detecting various health factors as a whole that can cause certain diseases in the future. Health screening is important because it can help prevent degenerative diseases. Students in the community are required to be actively involved in developing community knowledge, one of which is in the health sector. Through the Community Service Program (KKN), students carry out blood type checks and free health checks for the community in the Kejambon Kidul hamlet. The method of implementing the activities is carried out with a direct approach which is divided into two stages, namely the preparation stage and the implementation stage. The results of the activity showed that the activity was attended by 68 people, consisting of 15 men and 53 women, with the distribution of blood type A; Rh+ as many as 18 people, B; Rh+ as many as 16 people, AB; Rh+ as many as 10 people, O; Rh+ as many as 22 people and 1 person with blood type O; Rh-. The results of the blood pressure examination showed that as many as 13 people had abnormal blood pressure, 12 people with hypertension and 1 person with hypotension.
Article
Full-text available
The capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R&D) approaches may be described as problem oriented, and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R&D efforts need to be broadened to a new research paradigm-one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems' suitability for POCRC (3 that represent NLP system properties and 1 associated with the R&D process)-(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize diverse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an example, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.
Article
The practice of medicine is changing rapidly as a consequence of electronic health record adoption, new technologies for patient care, disruptive innovations that breakdown professional hierarchies, and evolving societal norms. Collectively, these have resulted in the modification of the physician's role as the gatekeeper for health care, increased shift-based care, and amplified interprofessional team-based care. Technological innovations present opportunities as well as challenges. Artificial intelligence, which has great potential, has already transformed some tasks, particularly those involving image interpretation. Ubiquitous access to information via the internet by physicians and patients alike presents benefits as well as drawbacks: patients and providers have ready access to virtually all of human knowledge, but some websites are contaminated with misinformation and many people have difficulty differentiating between solid, evidence-based data and untruths. The role of the future physician will shift as complexity in health care increases and as artificial intelligence and other technologies advance. These technological advances demand new skills of physicians; memory and knowledge accumulation will diminish in importance while information management skills will become more important. In parallel, medical educators must enhance their teaching and assessment of critical human skills (e.g., clear communication, empathy) in the delivery of patient care. The authors emphasize the enduring role of critical human skills in safe and effective patient care even as medical practice is increasingly guided by artificial intelligence and related technology, and they suggest new and longitudinal ways of assessing essential non-cognitive skills to meet the demands of the future. The authors envision practical and achievable benefits accruing to patients and providers if practitioners leverage technological advancements to facilitate the development of their critical human skills.
Conference Paper
Full-text available
We propose a new shared task on grading student answers with the goal of enabling well-targeted and flexible feedback in a tutorial dialogue setting. We provide an annotated corpus designed for the purpose, a precise specification for a prediction task and an associated evaluation methodology. The task is feasible but non-trivial, which is demonstrated by creating and comparing three alternative baseline systems. We believe that this corpus will be of interest to the researchers working in textual entailment and will stimulate new developments both in natural language processing in tutorial dialogue systems and textual entailment, contradiction detection and other techniques of interest for a variety of computational linguistics tasks.
Article
Full-text available
Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.
Conference Paper
Full-text available
Traditionally, automatic marking has been restricted to item types such as multiple choice that narrowly constrain how students may respond. More open ended items have generally been considered unsuitable for machine marking because of the difficulty of coping with the myriad ways in which credit-worthy answers may be expressed. Successful automatic marking of free text answers would seem to presuppose an advanced level of performance in automated natural language understanding. However, recent advances in computational linguistics techniques have opened up the possi-bility of being able to automate the marking of free text responses typed into a computer without having to create systems that fully understand the answers. This paper describes the use of information extraction and machine learning techniques in the marking of short, free text responses of up to around five lines.
Article
Full-text available
Introduced the statistic kappa to measure nominal scale agreement between a fixed pair of raters. Kappa was generalized to the case where each of a sample of 30 patients was rated on a nominal scale by the same number of psychiatrist raters (n = 6), but where the raters rating 1 s were not necessarily the same as those rating another. Large sample standard errors were derived.
Article
Full-text available
We present an approach to Computer-Assisted Assessment of free-text material based on symbolic analysis of student input. The theory that underlies this approach arises from previous work on DidaLect, a tutorial system for reading comprehension in French as a Second Language. The theory enables processing of a free-text segment for assessment to operate without precoded reference material. A study based on a small collection of student answers to several types of questions has justified our approach, and helped to define a methodology and design a prototype.
Article
Full-text available
This is a conference paper. Automated essay scoring with latent semantic analysis (LSA) has recently been subject to increasing interest. Although previous authors have achieved grade ranges similar to those awarded by humans, it is still not clear which and how parameters improve or decrease the effectiveness of LSA. This paper presents an analysis of the effects of these parameters, such as text pre-processing, weighting, singular value dimensionality and type of similarity measure, and benchmarks this effectiveness by comparing machine-assigned with human-assigned scores in a real-world case. We show that each of the identified factors significantly influences the quality of automated essay scoring and that the factors are not independent of each other.
Article
Full-text available
This is a conference paper. This paper describes and exemplifies an application of AutoMark, a software system developed in pursuit of robust computerised marking of free-text answers to open-ended questions. AutoMark employs the techniques of Information Extraction to provide computerised marking of short free-text responses. The system incorporates a number of processing modules specifically aimed at providing robust marking in the face of errors in spelling, typing, syntax, and semantics. AutoMark looks for specific content within free-text answers, the content being specified in the form of a number of mark scheme templates. Each template represents one form of a valid (or a specifically invalid) answer. Student answers are first parsed, and then intelligently matched against each mark scheme template, and a mark for each answer is computed. The representation of the templates is such that they can be robustly mapped to multiple variations in the input text. The current paper describes AutoMark for the first time, and presents the results of a brief quantitative and qualitative study of the performance of the system in marking a range of free-text responses in one of the most demanding domains: statutory national curriculum assessment of science for pupils at age 11. This particular domain has been chosen to help identify the strengths and weaknesses of the current system in marking responses where errors in spelling, syntax, and semantics are at their most frequent. Four items of varying degrees of open-endedness were selected from the 1999 tests. These items are drawn from the real-world of so-called ‘high stakes’ testing experienced by cohorts of over half a million pupils in England each year since 1995 at ages 11 and 14. A quantitative and qualitative study of the performance of the system is provided, together with a discussion of the potential for further development in reducing these errors. The aim of this exploration is to reveal some of the issues which need to be addressed if computerised marking is to play any kind of reliable role in the future development of such test regimes.
Article
Constructed response items can both measure the coherence of student ideas and serve as reflective experiences to strengthen instruction. We report on new automated scoring technologies that can reduce the cost and complexity of scoring constructed-response items. This study explored the accuracy of c-rater-ML, an automated scoring engine developed by Educational Testing Service, for scoring eight science inquiry items that require students to use evidence to explain complex phenomena. Automated scoring showed satisfactory agreement with human scoring for all test takers as well as specific subgroups. These findings suggest that c-rater-ML offers a promising solution to scoring constructed-response science items and has the potential to increase the use of these items in both instruction and assessment. © 2016 Wiley Periodicals, Inc. J Res Sci Teach
Article
Content-based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept-based scoring tool for content-based scoring, c-rater™, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed-response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept-based automated scoring.
Article
C-rater is an automated scoringengine that has been developed to scoreresponses to content-based short answerquestions. It is not simply a stringmatching program – instead it uses predicateargument structure, pronominal reference,morphological analysis and synonyms to assignfull or partial credit to a short answerquestion. C-rater has been used in two studies:National Assessment for Educational Progress(NAEP) and a statewide assessment in Indiana.In both studies, c-rater agreed with humangraders about 84% of the time.
Article
A framework for evaluation and use of automated scoring of constructed-response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high-stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.
Article
This paper presents a description and evaluation of SpeechRaterSM, a system for automated scoring of non-native speakers’ spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the highly unpredictable nature of the responses to this task type makes the challenge of accurate scoring much more difficult.SpeechRater uses a three-stage architecture. Responses are first processed by a filtering model to ensure that no exceptional conditions exist which might prevent them from being scored by SpeechRater. Responses not filtered out at this stage are then processed by the scoring model to estimate the proficiency rating which a human might assign to them, on the basis of features related to fluency, pronunciation, vocabulary diversity, and grammar. Finally, an aggregation model combines an examinee’s scores for multiple items to calculate a total score, as well as an interval in which the examinee’s score is predicted to reside with high confidence.SpeechRater’s current level of accuracy and construct representation have been deemed sufficient for low-stakes practice exercises, and it has been used in a practice exam for the TOEFL since late 2006. In such a practice environment, it offers a number of advantages compared to human raters, including system load management, and the facilitation of immediate feedback to students. However, it must be acknowledged that SpeechRater presently fails to measure many important aspects of speaking proficiency (such as intonation and appropriateness of topic development), and its agreement with human ratings of proficiency does not yet approach the level of agreement between two human raters.
Conference Paper
Automatic content scoring for free-text responses has started to emerge as an application of Natural Language Processing in its own right, much like question answering or machine translation. The task, in general, is reduced to comparing a student's answer to a model answer. Although a considerable amount of work has been done, common benchmarks and evaluation measures for this application do not currently exist. It is yet impossible to perform a comparative evaluation or progress tracking of this application across systems - an application that we view as a textual entailment task. This paper concentrates on introducing an Educational Testing Service-built test suite that makes a step towards establishing such a benchmark. The suite can be used as regression and performance evaluations both intra-c-rater ® or inter automatic content scoring technologies. It is important to note that existing textual entailment test suites like PASCAL RTE or FraCas, though beneficial, are not suitable for our purposes since we deal with atypical naturally-occurring student responses that need to be categorized in order to serve as regression test cases.
Article
In this paper we describe an algorithm called ConText for determining whether clinical conditions mentioned in clinical reports are negated, hypothetical, historical, or experienced by someone other than the patient. The algorithm infers the status of a condition with regard to these properties from simple lexical clues occurring in the context of the condition. The discussion and evaluation of the algorithm presented in this paper address the questions of whether a simple surface-based approach which has been shown to work well for negation can be successfully transferred to other contextual properties of clinical conditions, and to what extent this approach is portable among different clinical report types. In our study we find that ConText obtains reasonable to good performance for negated, historical, and hypothetical conditions across all report types that contain such conditions. Conditions experienced by someone other than the patient are very rarely found in our report set. A comprehensive solution to the problem of determining whether a clinical condition is historical or recent requires knowledge above and beyond the surface clues picked up by ConText.
Article
An essay-based discourse analysis system can help students improve their writing by identifying relevant essay-based discourse elements in their essays. Our discourse analysis software, which is embedded in Criterion, an online essay evaluation application, uses machine learning to identify discourse elements in student essays. The system makes decisions that exemplify how teachers perform this task. For instance, when grading student essays, teachers comment on the discourse structure. Teachers might explicitly state that the essay lacks a thesis statement or that an essay's single main idea has insufficient support. Training the systems to model this behavior requires human judges to annotate a data sample of student essays. The annotation schema reflects the highly structured discourse of genres such as persuasive writing. Our discourse analysis system uses a voting algorithm that takes into account the discourse labeling decisions of three independent systems.
A tale of two models: psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings
  • Engelhard
Automated writing evaluation: an expanding body of knowledge
  • Shermis