Article

Use of artificial intelligence in family medicine publications: Joint statement from journal editors

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Academic journals, archives, and repositories are seeing an increasing number of questionable research papers clearly produced using generative AI. They are often created with widely available, general-purpose AI applications, most likely ChatGPT, and mimic scientific writing. Google Scholar easily locates and lists these questionable papers alongside reputable, quality-controlled research. Our analysis of a selection of questionable GPT-fabricated scientific papers found in Google Scholar shows that many are about applied, often controversial topics susceptible to disinformation: the environment, health, and computing. The resulting enhanced potential for evidence hacking, i.e. the malicious manipulation of society's evidence base, particularly in politically divisive domains, is a growing concern. Research questions: - Where are questionable publications produced with generative pre-trained transformers (GPTs) that can be found via Google Scholar published or deposited? - What are the main characteristics of these publications in relation to predominant subject categories? - How are these publications spread in the research infrastructure for scholarly communication? - How is the role of the scholarly communication infrastructure challenged in maintaining public trust in science and evidence through inappropriate use of generative AI?
Article
Full-text available
The rapid advances in Generative AI tools have produced both excitement and worry about how AI will impact academic writing. However, little is known about what norms are emerging around AI use in manuscript preparation or how these norms might be enforced. We address both gaps in the literature by conducting a survey of 271 academics about whether it is necessary to report ChatGPT use in manuscript preparation and by running GPT-modified abstracts from 2,716 published papers through a leading AI detection software to see if these detectors can detect different AI uses in manuscript preparation. We find that most academics do not think that using ChatGPT to fix grammar needs to be reported, but detection software did not always draw this distinction, as abstracts for which GPT was used to fix grammar were often flagged as having a high chance of being written by AI. We also find disagreements among academics on whether more substantial use of ChatGPT to rewrite text needs to be reported, and these differences were related to perceptions of ethics, academic role, and English language background. Finally, we found little difference in their perceptions about reporting ChatGPT and research assistant help, but significant differences in reporting perceptions between these sources of assistance and paid proofreading and other AI assistant tools (Grammarly and Word). Our results suggest that there might be challenges in getting authors to report AI use in manuscript preparation because (i) there is not uniform agreement about what uses of AI should be reported and (ii) journals might have trouble enforcing nuanced reporting requirements using AI detection tools.
Article
Full-text available
This evidence summary explores the potential and limitations of using ChatGPT for developing systematic literature searches. A systematic search identified the current peer-reviewed and grey literature. Studies were selected according to eligibility criteria. Included studies were analysed and synthesised narratively, focusing on the strengths, limitations, and recommendations for using ChatGPT to assist with the systematic literature searching process. Current literature is mostly opinion-driven, and there is limited published literature originating from the library and information profession. At present, limitations outweigh the strengths of ChatGPT for systematic literature searching, caution should be exercised, and human oversight is essential. More research is required, and information specialists and librarians are in a prime position to develop guidelines and share examples of best practice.
Article
Full-text available
Background Systematically screening published literature to determine the relevant publications to synthesize in a review is a time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for the automation of language-related tasks that may be useful for such a purpose. Methods LLMs were used as part of an automated system to evaluate the relevance of publications to a certain topic based on defined criteria and based on the title and abstract of each publication. A Python script was created to generate structured prompts consisting of text strings for instruction, title, abstract, and relevant criteria to be provided to an LLM. The relevance of a publication was evaluated by the LLM on a Likert scale (low relevance to high relevance). By specifying a threshold, different classifiers for inclusion/exclusion of publications could then be defined. The approach was used with four different openly available LLMs on ten published data sets of biomedical literature reviews and on a newly human-created data set for a hypothetical new systematic literature review. Results The performance of the classifiers varied depending on the LLM being used and on the data set analyzed. Regarding sensitivity/specificity, the classifiers yielded 94.48%/31.78% for the FlanT5 model, 97.58%/19.12% for the OpenHermes-NeuralChat model, 81.93%/75.19% for the Mixtral model and 97.58%/38.34% for the Platypus 2 model on the ten published data sets. The same classifiers yielded 100% sensitivity at a specificity of 12.58%, 4.54%, 62.47%, and 24.74% on the newly created data set. Changing the standard settings of the approach (minor adaption of instruction prompt and/or changing the range of the Likert scale from 1–5 to 1–10) had a considerable impact on the performance. Conclusions LLMs can be used to evaluate the relevance of scientific publications to a certain review topic and classifiers based on such an approach show some promising results. To date, little is known about how well such systems would perform if used prospectively when conducting systematic literature reviews and what further implications this might have. However, it is likely that in the future researchers will increasingly use LLMs for evaluating and classifying scientific publications.
Article
Full-text available
Objective: To understand the current landscape of artificial intelligence (AI) for family medicine (FM) research in Canada, identify how the College of Family Physicians of Canada (CFPC) could support near-term positive progress in this field, and strengthen the community working in this field. Composition of the committee: Members of a scientific planning committee provided guidance alongside members of a CFPC staff advisory committee, led by the CFPC-AMS TechForward Fellow and including CFPC, FM, and AI leaders. Methods: This initiative included 2 projects. First, an environmental scan of published and gray literature on AI for FM produced between 2018 and 2022 was completed. Second, an invitational round table held in April 2022 brought together AI and FM experts and leaders to discuss priorities and to create a strategy for the future. Report: The environmental scan identified research related to 5 major domains of application in FM (preventive care and risk profiling, physician decision support, operational efficiencies, patient self-management, and population health). Although there had been little testing or evaluation of AI-based tools in practice settings, progress since previous reviews has been made in engaging stakeholders to identify key considerations about AI for FM and opportunities in the field. The round-table discussions further emphasized barriers to and facilitators of high-quality research; they also indicated that while there is immense potential for AI to benefit FM practice, the current research trajectory needs to change, and greater support is needed to achieve these expected benefits and to avoid harm. Conclusion: Ten candidate action items that the CFPC could adopt to support near-term positive progress in the field were identified, some of which an AI working group has begun pursuing. Candidate action items are roughly divided into avenues where the CFPC is well-suited to take a leadership role in tackling priority issues in AI for FM research and specific activities or initiatives the CFPC could complete. Strong FM leadership is needed to advance AI research that will contribute to positive transformation in FM.
Article
Full-text available
The conversation about consciousness of artificial intelligence (AI) is an ongoing topic since 1950s. Despite the numerous applications of AI identified in healthcare and primary healthcare, little is known about how a conscious AI would reshape its use in this domain. While there is a wide range of ideas as to whether AI can or cannot possess consciousness, a prevailing theme in all arguments is uncertainty. Given this uncertainty and the high stakes associated with the use of AI in primary healthcare, it is imperative to be prepared for all scenarios including conscious AI systems being used for medical diagnosis, shared decision-making and resource management in the future. This commentary serves as an overview of some of the pertinent evidence supporting the use of AI in primary healthcare and proposes ideas as to how consciousnesses of AI can support or further complicate these applications. Given the scarcity of evidence on the association between consciousness of AI and its current state of use in primary healthcare, our commentary identifies some directions for future research in this area including assessing patients’, healthcare workers’ and policy-makers’ attitudes towards consciousness of AI systems in primary healthcare settings.
Article
Full-text available
The recent release of highly advanced generative artificial intelligence (AI) chatbots, including ChatGPT and Bard, which are powered by large language models (LLMs), has attracted growing mainstream interest over its diverse applications in clinical practice, including in health and healthcare. The potential applications of LLM-based programmes in the medical field range from assisting medical practitioners in improving their clinical decision-making and streamlining administrative paperwork to empowering patients to take charge of their own health. However, despite the broad range of benefits, the use of such AI tools also comes with several limitations and ethical concerns that warrant further consideration, encompassing issues related to privacy, data bias, and the accuracy and reliability of information generated by AI. The focus of prior research has primarily centred on the broad applications of LLMs in medicine. To the author’s knowledge, this is, the first article that consolidates current and pertinent literature on LLMs to examine its potential in primary care. The objectives of this paper are not only to summarise the potential benefits, risks and challenges of using LLMs in primary care, but also to offer insights into considerations that primary care clinicians should take into account when deciding to adopt and integrate such technologies into their clinical practice.
Article
Full-text available
Background: ChatGPT (Chat Generative Pre-trained Transformer) is a 175-billion-parameter natural language processing model, thereby already being involved in scientific contents and publications. Its influence ranges from providing quick access to information on medical topics, assisting to generate medical and scientific articles and papers, performing medical data analyses and even interpreting complex data sets. Objective: The future role of ChatGPT remains uncertain and a matter of debate already shortly after its release. The aim of this review was to analyze the role ChatGPT in medical literature during the first three months after its release. Methods: We here performed a concise review of literature published in PubMed from 12-1-2022 to 3-31-2023. In order to find all publications related to ChatGPT or considering ChatGPT, the search term was kept simple ("ChatGPT" in AllFields). All publications were included that were available as full text in German or English. All accessible publications were evaluated according to specifications by the author team, e.g. impact factor, publication modus, article type, publication speed, type of chat GPT integration or content. The conclusions of the articles were used for later SWOT (strengths, weaknesses, opportunities or threats) analysis. All data were analyzed on a descriptive basis. Results: Of 178 studies in total, 160 could be evaluated. The average impact factor was 4.423 (0 - 96.216), average publication speed was 16 days (0-83 days). Of all articles, there were 77 editorials, 43 essays, 21 studies, six reviews, six case reports, six news, and one meta-analysis. Of those, 54.4% were published as open access with 11% provided on preprint servers. Over 400 quotes with information on strengths, weaknesses, opportunities, and threats were detected. By far the most were related to weaknesses. ChatGPT excels in its ability to express ideas clearly and formulate general contexts comprehensibly. It performs so well that even experts in the field have difficulties in identifying abstracts generated by ChatGPT, whereas the time-limited scope and precisely the need for corrections by experts were mentioned as weaknesses and threats. Opportunities include assistance in formulating medical issues for non-native English speakers as well as the possibility of timely participation in the development of such artificial intelligence tools, since it is in its early stages and can therefore still be influenced. Conclusions: Artificial intelligence tools such as ChatGPT are already part of the medical publishing landscape. Despite apparent opportunities, policies and guidelines have to be implemented to ensure benefits in education, clinical practice and research rather than threats such as scientific misconduct, plagiarism or accuracy.
Article
Full-text available
Generative artificial intelligence (AI) has the potential to transform many aspects of scholarly publishing. Authors, peer reviewers, and editors might use AI in a variety of ways, and those uses might augment their existing work or might instead be intended to replace it. We are editors of bioethics and humanities journals who have been contemplating the implications of this ongoing transformation. We believe that generative AI may pose a threat to the goals that animate our work but could also be valuable for achieving those goals. In the interests of fostering a wider conversation about how generative AI may be used, we have developed a preliminary set of recommendations for its use in scholarly publishing. We hope that the recommendations and rationales set out here will help the scholarly community navigate toward a deeper understanding of the strengths, limits, and challenges of AI for responsible scholarly work.
Article
Full-text available
IMPORTANCE The scientific community debates Generative Pre-trained Transformer (GPT)-3.5’s article quality, authorship merit, originality, and ethical use in scientific writing. OBJECTIVES Assess GPT-3.5’s ability to craft the background section of critical care clinical research questions compared to medical researchers with H-indices of 22 and 13. DESIGN Observational cross-sectional study. SETTING Researchers from 20 countries from six continents evaluated the backgrounds. PARTICIPANTS Researchers with a Scopus index greater than 1 were included. MAIN OUTCOMES AND MEASURES In this study, we generated a background section of a critical care clinical research question on “acute kidney injury in sepsis” using three different methods: researcher with H-index greater than 20, researcher with H-index greater than 10, and GPT-3.5. The three background sections were presented in a blinded survey to researchers with an H-index range between 1 and 96. First, the researchers evaluated the main components of the background using a 5-point Likert scale. Second, they were asked to identify which background was written by humans only or with large language model-generated tools. RESULTS A total of 80 researchers completed the survey. The median H-index was 3 (interquartile range, 1–7.25) and most (36%) researchers were from the Critical Care specialty. When compared with researchers with an H-index of 22 and 13, GPT-3.5 was marked high on the Likert scale ranking on main background components (median 4.5 vs. 3.82 vs. 3.6 vs. 4.5, respectively; p < 0.001). The sensitivity and specificity to detect researchers writing versus GPT-3.5 writing were poor, 22.4% and 57.6%, respectively. CONCLUSIONS AND RELEVANCE GPT-3.5 could create background research content indistinguishable from the writing of a medical researcher. It was marked higher compared with medical researchers with an H-index of 22 and 13 in writing the background section of a critical care clinical research question.
Article
Full-text available
Background: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. Objective: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. Methods: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. Results: Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. Conclusions: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.
Article
Full-text available
The proliferation of artificial intelligence (AI)-generated content, particularly from models like ChatGPT, presents potential challenges to academic integrity and raises concerns about plagiarism. This study investigates the capabilities of various AI content detection tools in discerning human and AI-authored content. Fifteen paragraphs each from ChatGPT Models 3.5 and 4 on the topic of cooling towers in the engineering process and five human-witten control responses were generated for evaluation. AI content detection tools developed by OpenAI, Writer, Copyleaks, GPTZero, and CrossPlag were used to evaluate these paragraphs. Findings reveal that the AI detection tools were more accurate in identifying content generated by GPT 3.5 than GPT 4. However, when applied to human-written control responses, the tools exhibited inconsistencies, producing false positives and uncertain classifications. This study underscores the need for further development and refinement of AI content detection tools as AI-generated content becomes more sophisticated and harder to distinguish from human-written text.
Article
Artificial Intelligence (AI) is poised to revolutionize family medicine, offering a transformative approach to achieving the Quintuple Aim. This article examines the imperative for family medicine to adapt to the rapidly evolving field of AI, with an emphasis on its integration in clinical practice. AI's recent advancements have the potential to significantly transform health care. We argue for the proactive engagement of family medicine in directing AI technologies toward enhancing the "Quintuple Aim."The article highlights potential benefits of AI, such as improved patient outcomes through enhanced diagnostic tools, clinician well-being through reduced administrative burdens, and the promotion of health equity by analyzing diverse data sets. However, we also acknowledge the risks associated with AI, including the potential for automation to diverge from patient-centered care and exacerbate health care disparities. Our recommendations stress the need for family medicine education to incorporate AI literacy, the development of a collaborative for AI integration, and the establishment of guidelines and standards through interdisciplinary cooperation. We conclude that although AI poses challenges, its responsible and ethical implementation can revolutionize family medicine, optimizing patient care and enhancing the role of clinicians in a technology-driven future.
Article
Generative artificial intelligence and large language models are the continuation of a technological revolution in information processing that began with the invention of the transistor in 1947. These technologies, driven by transformer architectures for artificial neural networks, are poised to broadly influence society. It is already apparent that these technologies will be adapted to drive innovation in education. Medical education is a high-risk activity: Information that is incorrectly taught to a student may go unrecognized for years until a relevant clinical situation appears in which that error can lead to patient harm. In this article, I discuss the principal limitations to the use of generative artificial intelligence in medical education—hallucination, bias, cost, and security—and suggest some approaches to confronting these problems. Additionally, I identify the potential applications of generative artificial intelligence to medical education, including personalized instruction, simulation, feedback, evaluation, augmentation of qualitative research, and performance of critical assessment of the existing scientific literature.
Article
Purpose: Worldwide clinical knowledge is expanding rapidly, but physicians have sparse time to review scientific literature. Large language models (eg, Chat Generative Pretrained Transformer [ChatGPT]), might help summarize and prioritize research articles to review. However, large language models sometimes "hallucinate" incorrect information. Methods: We evaluated ChatGPT's ability to summarize 140 peer-reviewed abstracts from 14 journals. Physicians rated the quality, accuracy, and bias of the ChatGPT summaries. We also compared human ratings of relevance to various areas of medicine to ChatGPT relevance ratings. Results: ChatGPT produced summaries that were 70% shorter (mean abstract length of 2,438 characters decreased to 739 characters). Summaries were nevertheless rated as high quality (median score 90, interquartile range [IQR] 87.0-92.5; scale 0-100), high accuracy (median 92.5, IQR 89.0-95.0), and low bias (median 0, IQR 0-7.5). Serious inaccuracies and hallucinations were uncommon. Classification of the relevance of entire journals to various fields of medicine closely mirrored physician classifications (nonlinear standard error of the regression [SER] 8.6 on a scale of 0-100). However, relevance classification for individual articles was much more modest (SER 22.3). Conclusions: Summaries generated by ChatGPT were 70% shorter than mean abstract length and were characterized by high quality, high accuracy, and low bias. Conversely, ChatGPT had modest ability to classify the relevance of articles to medical specialties. We suggest that ChatGPT can help family physicians accelerate review of the scientific literature and have developed software (pyJournalWatch) to support this application. Life-critical medical decisions should remain based on full, critical, and thoughtful evaluation of the full text of research articles in context with clinical guidelines.
Conference Paper
Literature reviews constitute an indispensable component of research endeavors; however, they often prove laborious and time-intensive. This study explores the potential of ChatGPT, a prominent large-scale language model, to facilitate the literature review process. By contrasting outcomes from a manual literature review with those achieved using ChatGPT, we ascertain the accuracy of ChatGPT's responses. Our findings indicate that ChatGPT aids researchers in swiftly perusing vast and heterogeneous collections of scientific publications, enabling them to extract pertinent information related to their research topic with an overall accuracy of 70%. Moreover, we demonstrate that ChatGPT offers a more economical and expeditious means of achieving this level of accuracy compared to human researchers. Nevertheless, we conclude that although ChatGPT exhibits promise in generating a rapid and cost-effective general overview of a subject, it presently falls short of generating a comprehensive literature overview requisite for scientific applications. Lastly, we propose avenues for future research to enhance the performance and utility of ChatGPT as a literature review assistant.
Article
Purpose: Rapid increases in technology and data motivate the application of artificial intelligence (AI) to primary care, but no comprehensive review exists to guide these efforts. Our objective was to assess the nature and extent of the body of research on AI for primary care. Methods: We performed a scoping review, searching 11 published or gray literature databases with terms pertaining to AI (eg, machine learning, bayes* network) and primary care (eg, general pract*, nurse). We performed title and abstract and then full-text screening using Covidence. Studies had to involve research, include both AI and primary care, and be published in Eng-lish. We extracted data and summarized studies by 7 attributes: purpose(s); author appointment(s); primary care function(s); intended end user(s); health condition(s); geographic location of data source; and AI subfield(s). Results: Of 5,515 unique documents, 405 met eligibility criteria. The body of research focused on developing or modifying AI methods (66.7%) to support physician diagnostic or treatment recommendations (36.5% and 13.8%), for chronic conditions, using data from higher-income countries. Few studies (14.1%) had even a single author with a primary care appointment. The predominant AI subfields were supervised machine learning (40.0%) and expert systems (22.2%). Conclusions: Research on AI for primary care is at an early stage of maturity. For the field to progress, more interdisciplinary research teams with end-user engagement and evaluation studies are needed.
World Association of Medical Editors
  • A I Chatbots
Chatbots, generative AI, and scholarly manuscripts. World Association of Medical Editors; 2023. Available from: https://wame.org/page2.php?id=106. Accessed 2024 Nov 27.
COPE position statement. Eastleigh, UK: Committee on Publication Ethics
  • A I Authorship
  • Tools
Authorship and AI tools. COPE position statement. Eastleigh, UK: Committee on Publication Ethics; 2023. Available from: https://publicationethics.org/copeposition-statements/ai-author. Accessed 2024 Nov 27.
What do we do about the biases in AI? Harvard Business Review
  • J Manyika
  • J Silberg
  • B Presten
Manyika J, Silberg J, Presten B. What do we do about the biases in AI? Harvard Business Review 2019 Oct 25. Available from: https://hbr.org/2019/10/what-do-wedo-about-the-biases-in-ai. Accessed 2024 Nov 27.