Although informal evaluations of modern LLMs can be found on social media, blogs, and
news outlets, a formal and comprehensive comparison among them has yet to be conducted.
In response to this gap, we have undertaken an extensive benchmark evaluation of LLMs and
conversational bots. Our evaluation involved the collection of 1002 questions encompassing 27
categories, which we refer to as the “Wordsmiths dataset.” These categories include reasoning,
logic, facts, coding, bias, language, humor, and more. Each question in the dataset is accompanied
by an accurate and verified answer. We meticulously assessed four leading chatbots: ChatGPT,
GPT-4, Bard, and Claude, using this dataset. The results of our evaluation revealed the following
key findings: a) GPT-4 emerged as the top-performing chatbot across all categories, achieving a
success rate of 84.1%. On the other hand, Bard faced challenges and achieved a success rate of
62.4%. b) Among the four models evaluated, one of them responded correctly approximately
93% of the time. However, all models were correct only about 44%. c) Bard is less correlated
with other models while ChatGPT and GPT-4 are highly correlated in terms of their responses.
d) Chatbots demonstrated proficiency in language understanding , facts, and self awareness.
However, they encountered difficulties in areas such as math, coding, IQ, and reasoning. e) In
terms of bias, discrimination, and ethics categories, models generally performed well, suggesting
they are relatively safe to utilize. To make future model evaluations on our dataset easier,
we also provide a multiple-choice version of it (called Wordsmiths-MCQ). The understanding
and assessment of the capabilities and limitations of modern chatbots hold immense societal
implications. In an effort to foster further research in this field, we have made our dataset
available for public access, which can be found at https://github.com/mehrdad-dev/Battle-of-the-Wordsmiths.