Ehud Reiter’s research while affiliated with University of Aberdeen and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (229)


A BN for the lung cancer problem described in [9]
Visualization of necessary and sufficient conditions of BN reusability. “PD” stands for Probability Distribution
Literature selection diagram
a)c) Analysis of sources used for the development of the different parts of BNs introduced in the papers included in the survey. c) Statistics of the BNs validation (we show only higher diagonal values of the co-occurrence matrix for ease of value perception).“ProbDist” stands for Probability Distribution
a) Analysis of reusability of certain parts of the BNs developed in the reviewed papers. b) Analysis of the technical reusability of the BNs. “ProbDist” stands for Probability Distribution

+2

Reusability of Bayesian Networks case studies: a survey
  • Article
  • Full-text available

February 2025

·

35 Reads

Applied Intelligence

·

·

Ehud Reiter

·

Bayesian Networks (BNs) are probabilistic graphical models used to represent variables and their conditional dependencies, making them highly valuable in a wide range of fields, such as radiology, agriculture, neuroscience, construction management, medicine, and engineering systems, among many others. Despite their widespread application, the reusability of BNs presented in papers that describe their application to real-world tasks has not been thoroughly examined. In this paper, we perform a structured survey on the reusability of BNs using the PRISMA methodology, analyzing 147 papers from various domains. Our results indicate that only 18% of the papers provide sufficient information to enable the reusability of the described BNs. This creates significant challenges for other researchers attempting to reuse these models, especially since many BNs are developed using expert knowledge elicitation. Additionally, direct requests to authors for reusable BNs yielded positive results in only 12% of cases. These findings underscore the importance of improving reusability and reproducibility practices within the BN research community, a need that is equally relevant across the broader field of Artificial Intelligence.

Download

The role of natural language processing in cancer care: a systematic scoping review with narrative synthesis

November 2024

·

14 Reads

Objectives To review studies of Natural Language Processing (NLP) systems that assist in cancer care, explore use cases and summarize current research progress. Methods A systematic scoping review, searching six databases (1) MEDLINE, (2) Embase, (3) IEEE Xplore, (4) ACM Digital Library, (5) Web of Science, and (6) ACL Anthology. Studies were included that reported NLP systems that had been used to improve cancer management by patients or clinicians. Studies were synthesized descriptively and using content analysis. Results Twenty-nine studies were included. Studies mainly applied NLP in mixed cancer types (n=10, 34.48%) and breast cancer (n=8, 27.59%). NLP was used in four main ways: (1) to support patient education and self-management; (2) to improve efficiency in clinical care by summarizing, extracting, and categorizing data, and supporting record-keeping; (3) to support prevention and early detection of patient problems or cancer recurrence; and (4) to improve cancer treatment by supporting clinicians to make evidence-based treatment decisions. Studies highlighted a wide variety of use cases for NLP technologies in cancer care. However, few technologies had been evaluated within clinical settings, none were evaluated against clinical outcomes, and none had been implemented into clinical care. Conclusion NLP has the potential to improve cancer care via several mechanisms, including information extraction and classification, which could enable automation and personalization of care processes. Additionally, NLP tools such as chatbots show promise in improving patient communication and support. However, there are deficiencies in the evaluation and clinical integration challenges. Interdisciplinary collaboration between computer scientists and clinicians will be essential if NLP technologies are to fulfil their potential to improve patient experience and outcomes. Registered Protocol: https://doi.org/10.17605/OSF.IO/G9DSR


Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain

October 2024

·

28 Reads

In this paper, we propose a model for building natural language explanations for Bayesian Network Reasoning in terms of factor arguments, which are argumentation graphs of flowing evidence, relating the observed evidence to a target variable we want to learn about. We introduce the notion of factor argument independence to address the outstanding question of defining when arguments should be presented jointly or separately and present an algorithm that, starting from the evidence nodes and a target node, produces a list of all independent factor arguments ordered by their strength. Finally, we implemented a scheme to build natural language explanations of Bayesian Reasoning using this approach. Our proposal has been validated in the medical domain through a human-driven evaluation study where we compare the Bayesian Network Reasoning explanations obtained using factor arguments with an alternative explanation method. Evaluation results indicate that our proposed explanation approach is deemed by users as significantly more useful for understanding Bayesian Network Reasoning than another existing explanation method it is compared to.


Figure 1: A BN for the lung cancer problem. Note that conditional probability tables encoding the probabilistic relations between variables are not shown here.
Figure 7: SHD between structures generated by 'LLM-experts" of all types for all BNs engaged in the study. Plot 1
The results of data contamination experiments.
Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis

July 2024

·

7 Reads

In this work, we propose a novel method for Bayesian Networks (BNs) structure elicitation that is based on the initialization of several LLMs with different experiences, independently querying them to create a structure of the BN, and further obtaining the final structure by majority voting. We compare the method with one alternative method on various widely and not widely known BNs of different sizes and study the scalability of both methods on them. We also propose an approach to check the contamination of BNs in LLM, which shows that some widely known BNs are inapplicable for testing the LLM usage for BNs structure elicitation. We also show that some BNs may be inapplicable for such experiments because their node names are indistinguishable. The experiments on the other BNs show that our method performs better than the existing method with one of the three studied LLMs; however, the performance of both methods significantly decreases with the increase in BN size.


Results of annotations for lay people. The statistics represent the count of annotated responses in each scenario. Problems found and the Likert rating are average values.
Effectiveness of ChatGPT in explaining complex medical reports to patients

June 2024

·

34 Reads

Mengxuan Sun

·

Ehud Reiter

·

Anne E Kiltie

·

[...]

·

Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so they are a good test of the ability of ChatGPT to explain complex medical reports to patients. We asked clinicians and lay people (not patients) to review explanations and responses of ChatGPT. We also ran three focus groups (including cancer patients, caregivers, computer scientists, and clinicians) to discuss output of ChatGPT. Our studies highlighted issues with inaccurate information, inappropriate language, limited personalization, AI distrust, and challenges integrating large language models (LLMs) into clinical workflow. These issues will need to be resolved before LLMs can be used to explain complex personal medical information to patients.


Example texts (partial basketball summaries) used to illustrate flaws in running experiments.
Buggy code for recording error annotations. This code goes through the data in Table 2, building a 2-level dictionary (game, sentence) from the annotation data. This is used for further analysis, including counting errors. However, this fails when a sentence contains more than one error; in such cases only the last error is recorded. The code should use a 3-level dictionary (game, sentence, tok_start). Corrected lines are included as code comments.
Item combinations created for an experiment by a flawed script. -sys3 data/sys2.txt should be -sys3 data/sys3.txt.
Bad interface design: The participant is asked to determine the number of sentences in Figure 1 and then input a comma-separated list of integer scores for Fluency, Coherence, and Informativeness of each sentence in the text. The first number in the list is supposed to be for the first sentence, the second number is supposed to be for the second sentence, etc.
Number of flaws of each type per anonymised experiment. Paper F was not selected for Phase 1, but is included as it makes for a good example (see MAD exclusion in Section 3.3).
Common Flaws in Running Human Evaluation Experiments in NLP

June 2024

·

15 Reads

·

4 Citations

While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.


PhilHumans: Benchmarking Machine Learning for Personal Health

May 2024

·

16 Reads

The use of machine learning in Healthcare has the potential to improve patient outcomes as well as broaden the reach and affordability of Healthcare. The history of other application areas indicates that strong benchmarks are essential for the development of intelligent systems. We present Personal Health Interfaces Leverag-ing HUman-MAchine Natural interactions (PhilHumans), a holistic suite of benchmarks for machine learning across different Healthcare settings-talk therapy, diet coaching, emergency care, intensive care, obstetric sonography-as well as different learning settings, such as action anticipation, timeseries modeling, insight mining, language modeling, computer vision, reinforcement learning and program synthesis


Feedback-Driven Insight Generation and Recommendation for Health Self-Management

March 2024

·

26 Reads

Purpose: This study aims to investigate the impact of personalized health insights generated from wearable device data on users' health behaviors. The primary objective is to assess whether user feedback-driven algorithms enhance the relevance and effectiveness of health insights, ultimately influencing positive changes in users' daily activities. Methods: A two-month field study was conducted with 25 healthy volunteers using Mi Band 6 wearable devices. Participants were divided into test and control groups, and the test group received personalized insights recommended by a neural network-based algorithm fine-tuned by user feedback. The data collected included various health parameters such as calories burned, step count, heart rate, heart minutes, active minutes, sleep duration, sleep time and sleep segments. Insights were provided through a Telegram chatbot and user feedback was collected through a rating system. Results: The study revealed that the test group, which considered user feedback for insight recommendations, showed a significant improvement in daily activity compared to the control group. The relevance of the insights over time, as evidenced by feedback regression trends, showed a notable increase in the test group. Additional analyzes explored the relationship between insight delivery timing, user feedback, and delays, providing insights into user engagement patterns. Conclusion: This research highlights the effectiveness of personalized health insights generated from wearable data in positively influencing user health behaviors. Incorporating feedback from users into recommendation algorithms greatly enhances the relevance and effectiveness of insights, encouraging behavioral improvements. The results emphasize the significance of timing when providing insights and propose potential areas for future investigation, such as utilizing Graph Neural Networks to improve recommendation systems. In general, personalized insights from wearables have the potential to empower individuals to manage their health and well-being effectively.


Evaluation of Human-Understandability of Global Model Explanations Using Decision Tree

January 2024

·

15 Reads

·

2 Citations

Communications in Computer and Information Science

In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that are more comprehensible and instil trust in the model’s operations. We hypothesise that generating model explanations that are narrative, patient-specific and global (holistic of the model) would enable better understandability and enable decision-making. We test this using a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. These explanations are presented to non-expert users. We find a strong individual preference for a specific type of explanation. The majority of participants prefer global explanations, while a smaller group prefers local explanations. A task based evaluation of mental models of these participants provide valuable feedback to enhance narrative global explanations. This, in turn, guides the design of health informatics systems that are both trustworthy and actionable.



Citations (69)


... This can result in nutritional deficiencies, worsening medical diseases, and disordered eating behaviors. AI cannot replace human judgment or the sophisticated understanding and empathy of dietitians or nutritionists (Balloccu et al., 2024). It may also neglect or misunderstand health concerns, leading to personalized treatment and socio-psychological dimensions of diet. ...

Reference:

AI-Driven Personalized Nutrition Apps and Platforms for Enhanced Diet and Wellness
Ask the experts: sourcing a high-quality nutrition counseling dataset through Human-AI collaboration

... Some works try to employ LLMs in table-related tasks, such as TableQA (Zhang et al., 2024b), table entity matching , and table-to-text generation (Sundararajan et al., 2024). The survey (Lu et al., 2024) summarizes the application of LLMs on tabular data. ...

Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo

... FMs (Bommasani et al., 2021) are a family of deep models that are pretrained on vast amounts of data, and have caused a paradigm shift due to their unprecedented capabilities for zero-shot and few-shot generalization. FMs have revolutionized natural language processing (Brown et al., 2020;BigScience Workshop et al., 2023;Wu et al., 2024;Dubey et al., 2024) and computer vision (Radford et al., 2021;Kirillov et al., 2023). The availability of large-scale time series datasets has opened the door to pretrain a large model on time series data. ...

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Major Contributors Prompt Engineering BigScience Workshop Architecture and Objective Engineering Evaluation and Interpretability Broader Impacts

... Machine learning algorithms for classification can be divided into two categories depending on whether the final result of the machine learning is in a form that is easy for humans to read or not. For example, the trained results of deep learning algorithms are very difficult to understand, [2], while the trained results of decision trees are easy to understand unless the trees are not very large, [3]. So, we want to generate some easy-to-understand machine learning results because of the nature of the original data. ...

Evaluation of Human-Understandability of Global Model Explanations Using Decision Tree

Communications in Computer and Information Science

... However, unless those reproducing the results check and understand the code, the results might not reflect a true finding, even in a restricted sense where findings only apply for the selected datasets. The code may not implement the methodology or experiment described in the article (Raff and Farris 2023), or the code or the ancillary software may contain errors (Thomson, Reiter, and Belz 2024). Sharing selfcontained and executable artifacts will simplify the verification that running the code produces the reported result, but is not sufficient for reproducibility; repeatability is not reproducibility. ...

Common Flaws in Running Human Evaluation Experiments in NLP

... Recently there has been a promising surge of interest in reproducibility of NLP models, supported by challenges (Pineau et al., 2021), shared tasks (Belz et al., 2020), conference tracks (Carpuat et al., 2022), and even the Reality Check theme at this conference. The outcome of this surge in interest has been a flurry of reproducibility studies and related investigations (Belz et al., 2022a;Arvan et al., 2022a;Chen et al., 2022b). ...

ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG
  • Citing Conference Paper
  • January 2020

... Li et al. (2022)'s survey paper summarizes the different evaluation and mitigation techniques which can be used to address faithfulness in NLG. Evaluation has given rise to several recent shared tasks such as Gehrmann et al. (2021) and Thomson and Reiter (2021). Multiple papers try to improve the reporting of models' mistakes by giving guidelines to avoid under-reporting of errors (van Miltenburg et al., 2021) or to provide standardized definitions of model errors (Howcroft et al., 2020;Belz et al., 2020). ...

Generation Challenges: Results of the Accuracy Evaluation Shared Task
  • Citing Conference Paper
  • January 2021

... In terms of evaluation, we consider three evaluation methods: (1) automatic evaluation, where we report multiple evaluation metrics additional to the commonly used variants of ROUGE (Lin 2004) in summarization tasks (Liang et al. 2022;Bai, Gao, and Huang 2021;Cao, Liu, and Wan 2020;Peng et al. 2021) since the performance of individual metrics may vary across datasets, challenging their reliability (Bhandari et al. 2020;Fabbri et al. 2021); (2) human evaluation, which reflects the actual quality of summaries according to human judgments and functions as a source of reliability measurement for automatic evaluation metrics; and (3) ChatGPT evaluation, where we examine the potential of ChatGPT as an alternative to human annotators given the same instruction considering the high cost of human evaluation (Gao et al. 2023) and issues of reproducibility (Belz et al. 2021;Belz, Thomson, and Reiter 2023;Chen, Belouadi, and Eger 2022). Our contributions are: ...

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results
  • Citing Conference Paper
  • January 2021

... For precise and fine-grained evaluation of model outputs, it is necessary to identify these errors on the level of word spans. There are two major ways to collect the span annotations: using either human (Thomson and Reiter, 2020) or LLM-based annotators (Kocmi and Federmann, 2023;Kasner and Dušek, 2024). ...

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
  • Citing Conference Paper
  • January 2020