PreprintPDF Available

Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Although informal evaluations of modern LLMs can be found on social media, blogs, and news outlets, a formal and comprehensive comparison among them has yet to be conducted. In response to this gap, we have undertaken an extensive benchmark evaluation of LLMs and conversational bots. Our evaluation involved the collection of 1002 questions encompassing 27 categories, which we refer to as the “Wordsmiths dataset.” These categories include reasoning, logic, facts, coding, bias, language, humor, and more. Each question in the dataset is accompanied by an accurate and verified answer. We meticulously assessed four leading chatbots: ChatGPT, GPT-4, Bard, and Claude, using this dataset. The results of our evaluation revealed the following key findings: a) GPT-4 emerged as the top-performing chatbot across all categories, achieving a success rate of 84.1%. On the other hand, Bard faced challenges and achieved a success rate of 62.4%. b) Among the four models evaluated, one of them responded correctly approximately 93% of the time. However, all models were correct only about 44%. c) Bard is less correlated with other models while ChatGPT and GPT-4 are highly correlated in terms of their responses. d) Chatbots demonstrated proficiency in language understanding , facts, and self awareness. However, they encountered difficulties in areas such as math, coding, IQ, and reasoning. e) In terms of bias, discrimination, and ethics categories, models generally performed well, suggesting they are relatively safe to utilize. To make future model evaluations on our dataset easier, we also provide a multiple-choice version of it (called Wordsmiths-MCQ). The understanding and assessment of the capabilities and limitations of modern chatbots hold immense societal implications. In an effort to foster further research in this field, we have made our dataset available for public access, which can be found at
Content may be subject to copyright.
... Using pre-defined queries, Borji and Mohammadian [29] benchmarked the largest LLMs, such as GPT-4 and Bard, in their work. Based on their findings, the GPT-4 model appeared to be the most reliable for tasks centered on software development, code generation, and code comprehension, with great potential for usage in more complex scenarios. ...
... Moreover, the expanded token limit will enable GPT-4 to analyze larger chunks of data within even more extensive and specific prompts.In the case of GPT-3.5, although some improvement may occur with similar tightening and further refinement of the prompts,anomalies such as hallucinated AuthGuards and high misinterpretation in the detection of sensitive data lead us to strongly question the usefulness of the model for such purposes and are in line with the observations of Cheshkov et al. [34], who obtained better results even with a dummy categorizer than with GPT-3.5. On the other hand,the results fromGPT-4 support the conclusion of Borji and Mohammadian [29] that GPT-4 is possibly the most promising LLM currently available for software development and source code interpretation tasks. ...
Full-text available
Due to the proliferation of large language models (LLMs) and their widespread use in applications such as ChatGPT, there has been a significant increase in interest in AI over the past year. Multiple researchers have raised the question: how will AI be applied and in what areas? Programming, including the generation, interpretation, analysis, and documentation of static program code based on promptsis one of the most promising fields. With the GPT API, we have explored a new aspect of this: static analysis of the source code of front-end applications at the endpoints of the data path. Our focus was the detection of the CWE-653 vulnerability—inadequately isolated sensitive code segments that could lead to unauthorized access or data leakage. This type of vulnerability detection consists of the detection of code segments dealing with sensitive data and the categorization of the isolation and protection levels of those segments that were previously not feasible without human intervention. However, we believed that the interpretive capabilities of GPT models could be explored to create a set of prompts to detect these cases on a file-by-file basis for the applications under study, and the efficiency of the method could pave the way for additional analysis tasks that were previously unavailable for automation. In the introduction to our paper, we characterize in detail the problem space of vulnerability and weakness detection, the challenges of the domain, and the advances that have been achieved in similarly complex areas using GPT or other LLMs. Then, we present our methodology, which includes our classification of sensitive data and protection levels. This is followed by the process of preprocessing, analyzing, and evaluating static code. This was achieved through a series of GPT prompts containing parts of static source code, utilizing few-shot examples and chain-of-thought techniques that detected sensitive code segments and mapped the complex code base into manageable JSON structures.Finally, we present our findings and evaluation of the open source project analysis, comparing the results of the GPT-based pipelines with manual evaluations, highlighting that the field yields a high research value. The results show a vulnerability detection rate for this particular type of model of 88.76%, among others.
... Since the launch of GPT-4 in March 2023, there have been no major updates. Furthermore, competing models, such as Google's Bard, still underperform in tests and benchmarks (Ali et al., 2023;Borji & Mohammadian, 2023;Holmes et al., 2023). While large corporations like Alphabet, Apple, Microsoft, and Meta could gain major benefits from AI, tangible results are yet to be realized (Leswing, 2023). ...
Full-text available
ChatGPT is widely used among students, a situation that challenges educators. The current paper presents two strategies that do not push educators into a defensive role but can empower them. Firstly, we show, based on statistical analysis, that ChatGPT use can be recognized from certain keywords such as ‘delves’ and ‘crucial’. This insight allows educators to detect ChatGPT-assisted work more effectively. Secondly, we illustrate that ChatGPT can be used to assess texts written by students. The latter topic was presented in two interactive workshops pro-vided to educators and educational specialists. The results of the workshops, where prompts were tested live, indicated that ChatGPT, provided a targeted prompt is used, is good at recognizing errors in texts but not consistent in grading. Ethical and copyright concerns were raised as well in the workshops. In conclusion, the methods pre-sented in this paper may help fortify the teaching methods of educators. The computer scripts that we used for live prompting are available and enable educators to give similar workshops.
Full-text available
ChatGPT is a fascinating AI text generator tool. It is a language model developed by OpenAI, a research and deployment company with the mission, according to OpenAI’s website: “to ensure that artificial general intelligence benefits all of humanity”. ChatGPT is able to generate human-like texts. But how does it work? What about the quality of the texts it provides? And is it capable of being self-reflective? Information sources must be efficient, effective and reliable in education, in order to enhance students’ learning process. For this reason, we started a dialogue with ChatGPT-3 while using, among others, a SWOT analysis it generated about its own functioning in an educational setting. This enabled us, as human authors, to analyze the extent to which this AI system is able to practice self-reflection. Finally, the paper sketches implications for education and future research.
Full-text available
Transformative artificially intelligent tools, such as ChatGPT, designed to generate sophisticated text indistinguishable from that produced by a human, are applicable across a wide range of contexts. The technology presents opportunities as well as, often ethical and legal, challenges, and has the potential for both positive and negative impacts for organisations, society, and individuals. Offering multi-disciplinary insight into some of these, this article brings together 43 contributions from experts in fields such as computer science, marketing, information systems, education, policy, hospitality and tourism, management, publishing, and nursing. The contributors acknowledge ChatGPT’s capabilities to enhance productivity and suggest that it is likely to offer significant gains in the banking, hospitality and tourism, and information technology industries, and enhance business activities, such as management and marketing. Nevertheless, they also consider its limitations, disruptions to practices, threats to privacy and security, and consequences of biases, misuse, and misinformation. However, opinion is split on whether ChatGPT’s use should be restricted or legislated. Drawing on these contributions, the article identifies questions requiring further research across three thematic areas: knowledge, transparency, and ethics; digital transformation of organisations and societies; and teaching, learning, and scholarly research. The avenues for further research include: identifying skills, resources, and capabilities needed to handle generative AI; examining biases of generative AI attributable to training datasets and processes; exploring business and societal contexts best suited for generative AI implementation; determining optimal combinations of human and generative AI for various tasks; identifying ways to assess accuracy of text produced by generative AI; and uncovering the ethical and legal issues in using generative AI across different contexts.
Full-text available
Large language models represent a significant advancement in the field of AI. The underlying technology is key to further innovations and, despite critical views and even bans within communities and regions, large language models are here to stay. This commentary presents the potential benefits and challenges of educational applications of large language models, from student and teacher perspectives. We briefly discuss the current state of large language models and their applications. We then highlight how these models can be used to create educational content, improve student engagement and interaction, and personalize learning experiences. With regard to challenges, we argue that large language models in education require teachers and learners to develop sets of competencies and literacies necessary to both understand the technology as well as their limitations and unexpected brittleness of such systems. In addition, a clear strategy within educational systems and a clear pedagogical approach with a strong focus on critical thinking and strategies for fact checking are required to integrate and take full advantage of large language models in learning settings and teaching curricula. Other challenges such as the potential bias in the output, the need for continuous human oversight, and the potential for misuse are not unique to the application of AI in education. But we believe that, if handled sensibly, these challenges can offer insights and opportunities in education scenarios to acquaint students early on with potential societal biases, criticalities, and risks of AI applications. We conclude with recommendations for how to address these challenges and ensure that such models are used in a responsible and ethical manner in education.
One of the most important and influential philosophers of the last 30 years, John Searle has been concerned throughout his career with a single overarching question: how can we have a unified and theoretically satisfactory account of ourselves and of our relations to other people and to the natural world? In other words, how can we reconcile our common-sense conception of ourselves as conscious, free, mindful, rational agents in a world that we believe comprises brute, unconscious, mindless, meaningless, mute physical particles in fields of force? The essays in this collection are all related to the broad overarching issue that unites the diverse strands of Searle's work. Gathering in an accessible manner essays available only in relatively obscure books and journals, this collection will be of particular value to professionals and upper-level students in philosophy as well as to Searle's more extended audience in such fields as psychology and linguistics.
Language models are few-shot learners
  • Tom Brown
A comprehensive survey of ai-generated content
  • Yihan Cao
Towards understanding and mitigating social biases in language models
  • Paul Pu Liang
Gender shades: Intersectional accuracy disparities in commercial gender classification
  • Joy Buolamwini