ArticlePDF Available

The Context Windows Fallacy in Large Language Models

Authors:

Abstract

The integration of Large Language Models (LLMs) into various applications as quasi-search engines raises significant concerns about their impact on misinformation and societal decision-making processes. LLMs are designed to generate text that mimics human speech, often detached from factual reality, which poses risks when used unchecked. Developers and corporations advancing LLM technology argue for enhanced effectiveness through increased context window size and computational power. However, this paper challenges this assertion, arguing that augmenting the context window primarily improves the LLMs' ability to generate human-like narratives rather than enhancing their capacity for real-world decision-making. The paper advocates for a paradigm shift where LLMs move beyond merely sounding human to effectively adjudicating real-world challenges, emphasizing the need for ethical and practical advancements in AI development to mitigate the risks associated with misinformation and naive use of LLMs in critical decision-making processes. Finally, the paper proposes alternative approaches and criteria to address identified limitations; including grounded language models, causal reasoning integration, ethical frameworks, hybrid systems, and interactive learning.
The Context Windows Fallacy in Large Language
Models
Inês Hipólito1
1Department of Philosophy, Macquarie University, Australia
Abstract
The integration of Large Language Models (LLMs) into various applications as
quasi-search engines raises significant concerns about their impact on misinformation
and societal decision-making processes. LLMs are designed to generate text that
mimics human speech, often detached from factual reality, which poses risks when
used unchecked. Developers and corporations advancing LLM technology argue for
enhanced effectiveness through increased context window size and computational
power. However, this paper challenges this assertion, arguing that augmenting
the context window primarily improves the LLMs’ ability to generate human-like
narratives rather than enhancing their capacity for real-world decision-making. The
paper advocates for a paradigm shift where LLMs move beyond merely sounding
human to effectively adjudicating real-world challenges, emphasizing the need for
ethical and practical advancements in AI development to mitigate the risks associated
with misinformation and naive use of LLMs in critical decision-making processes.
Finally, the paper proposes alternative approaches and criteria to address identified
limitations; including grounded language models, causal reasoning integration, ethical
frameworks, hybrid systems, and interactive learning.
Keywords: Large Language Models (LLMs), misinformation, societal decision-making,
human-like text generation, context window size, computational power, real-world decision-
making, ethical AI development
1 Introduction
The tech industry developing Large Language Models (LLMs) promotes that its is possible
to make significant progress in improving the accuracy of a models’ outputs by increasing
the context window. For instance, OpenAI has emphasized the expanded context window
of GPT-4 Turbo, which now extends to 128k tokens. This enhancement is claimed to
improve the model’s ability to handle complex tasks and maintain coherence over longer
texts, theoretically leading to more accurate and reliable outputs Chen [2023], Chowdhery
[2023], Liu [2024].
Similarly, Google’s PaLM 2 boasts an extended context length, which is purported to
improve performance across various benchmarks, further supporting the idea that larger
context windows can lead to better model performance Cheng et al. [2023], Chowdhery
[2023], Liu [2024].
1
Recent advancements in the field have also seen the introduction of models like
Anthropic’s Claude 3, which can process up to 200,000 tokens, significantly surpassing
previous capabilities Cardillo [2024]. This trend is further exemplified by Meta’s LLaMA
3 and Google’s Gemini 1.5, both of which have been designed with extended context
windows to enhance their performance across diverse applications Guinness [2024].
These claims suggest a direct correlation between larger context windows and more
accurate, reliable outputs. The underlying assumption is that with more context, LLMs can
better understand and respond to prompts, potentially mitigating issues of hallucination
and confabulation Zhang [2024]. This is supported by recent studies indicating that larger
context windows help maintain the relevance and coherence of generated text, thereby
reducing the incidence of errors and improving overall model reliability Revelry [2023],
Appen [2024].
But as these AI wordsmiths infiltrate our daily lives, from chatbots to content creation,
we face a critical question: Are we unleashing a flood of convincing misinformation?
This paper challenges the tech industry’s focus on bigger context windows and more
computing power. These enhancements merely polish the AI’s eloquence, not its grasp
of reality: augmenting the context window primarily enhances LLMs’ ability to generate
human-like narratives rather than their capacity for real-world decision-making. This
paper advocates for a paradigm shift in AI development where LLMs move beyond merely
sounding human to effectively addressing real-world challenges. It emphasizes the need for
ethical and practical advancements to mitigate the risks associated with misinformation
and the naive use of LLMs in critical decision-making processes. While this paper critically
examines claims about the reach of increased context windows, it acknowledges that such
enhancements may offer certain advantages in specific applications, particularly in tasks
requiring long-range coherence and context retention; however, the primary focus remains
on the limitations of this approach in addressing fundamental issues of AI real-world
understanding and decision-making claims.
This paper employs a philosophical approach to critically examine the claims sur-
rounding LLMs and context window expansion. Our methodology consists of conceptual
analysis of key terms and claims in LLM development and logical argumentation to expose
potential fallacies in the LLM space, aiming to provide a foundation for future empirical
investigations and to guide more critical thinking in AI development and use.
The paper will unfold as follows: first, it will briefly define and explain LLMs as a
form of generative AI, focusing particularly on their role in mimicking human speech. The
discussion will center on how LLMs utilize context windows to achieve human-like output.
Next, it will identify the limitations associated with LLMs, particularly their tendency
towards hallucinations and confabulations. These phenomena highlight the challenges
posed by LLMs when generating text detached from factual reality. Finally, the paper will
present its central argument: The Context Windows Fallacy. This section will critically
examine the prevailing belief that expanding context windows enhances LLMs’ capacity for
nuanced understanding and decision-making, challenging this notion within the broader
philosophical and ethical framework of AI development. While critiquing the current focus
on expanding context windows in LLM development, the last section proposes several
alternative approaches to address the identified limitations. These include developing
grounded language models, integrating causal reasoning mechanisms, implementing ethical
frameworks and constrained generation, exploring hybrid symbolic-neural systems, fostering
collaborative and interactive learning, and enhancing metacognitive capabilities. By
exploring these alternatives, we aim to shift the paradigm of LLM development towards
2
systems that demonstrate not just linguistic fluency, but genuine understanding, ethical
reasoning, and reliable decision-making capabilities.
2 What are LLMs?
Large Language Models (LLMs) are advanced artificial intelligence systems designed to
process and generate human-like text based on vast amounts of training data Brown et al.
[2020b], Hadi et al. [2023]. These models, such as GPT-3 and its successors, utilize deep
learning techniques, particularly transformer architectures, to understand and produce
coherent and contextually appropriate language Vaswani et al. [2017].
LLMs are characterized by their ability to perform a wide range of language tasks,
including text completion, translation, summarization, and even rudimentary reasoning,
without being specifically trained for each task Bommasani et al. [2021]. This versatility
stems from their "few-shot" learning capabilities, allowing them to adapt to new tasks
with minimal examples Wei et al. [2022].
The primary goal of LLMs is to generate text that is indistinguishable from human-
written content. They achieve this by predicting the most probable next word in a
sequence, based on the patterns learned from their training data Bender et al. [2021a].
This approach allows LLMs to produce fluent and contextually relevant responses, often
giving the impression of understanding and intelligence Floridi and Chiriatti [2020a].
LLMs operate on a foundation of deep learning techniques, particularly leveraging
transformer architectures, to process and generate human-like text according to the
following steps:
Transformer Architecture: LLMs are built upon transformer architectures, ini-
tially introduced by Vaswani et al. in 2017 Vaswani et al. [2017]. Transformers
rely on self-attention mechanisms that allow the model to weigh the importance of
different words in a sentence relative to each other. This mechanism enables the
model to capture long-range dependencies within the input text efficiently.
Pre-training: Before LLMs can generate text, they undergo extensive pre-training
on vast amounts of text data. During pre-training, the model learns to predict
the next word in a sequence based on the context provided by preceding words.
This process helps the model develop an understanding of syntactic and semantic
structures in the language Brown et al. [2020b].
Fine-tuning: After pre-training, LLMs are fine-tuned on specific tasks or domains.
Fine-tuning involves exposing the model to task-specific data and adjusting its
parameters to improve performance on the target task. This step allows LLMs to
adapt their learned representations to different tasks without requiring extensive
retraining from scratch Bommasani et al. [2021].
Tokenization: Text input to LLMs is tokenized, meaning it is split into smaller
units such as words, subwords, or characters, which are then converted into numerical
representations understandable by the model Vaswani et al. [2017].
Model Architecture: LLMs typically consist of multiple layers of transformers.
Each transformer layer contains self-attention mechanisms and feedforward neural
networks. The self-attention mechanism enables the model to focus on different
3
Input Text
Tokenization
Pre-training
Fine-tuning
Transformer Architecture
Decoding
Output Text
Figure 1: Overview of Large Language Model (LLM) Processes
parts of the input text, while the feedforward networks process and transform the
information obtained through attention Vaswani et al. [2017].
Decoding: When generating text, LLMs use a decoding process where they predict
the most likely sequence of words following a given prompt. This involves iteratively
generating one word at a time based on the probabilities assigned to possible words
in the model’s vocabulary Bender et al. [2021a].
Output Generation: The output generated by LLMs is conditioned on the input
prompt and the context provided by previous words in the generated sequence. The
model aims to produce text that is contextually coherent and fluent, resembling
human-written language Floridi and Chiriatti [2020a].
Evaluation and Optimization: LLMs are evaluated based on metrics such as
perplexity (a measure of how well the model predicts the next word) and are optimized
through techniques like gradient descent, where the model’s parameters are adjusted
to minimize prediction errors during training Wang et al. [2022], Tunstall et al.
[2022].
Overall, LLMs represent a sophisticated blend of transformer architectures, extensive
pre-training on large datasets, and fine-tuning for specific tasks, enabling them to generate
high-quality text output across a range of natural language processing applications. Their
4
effectiveness lies in their ability to learn complex patterns in language and generate human-
like responses based on learned statistical relationships in the training data Yenduri [2024],
Zhang et al. [2023].
However, it’s crucial to note that despite their impressive output, LLMs do not possess
true understanding or knowledge in the way humans do. They are essentially sophisticated
pattern recognition systems, capable of producing human-like text without grasping the
underlying meaning or having real-world knowledge Marcus [2020a]. This limitation is at
the core of the challenges and risks associated with their widespread use, particularly in
contexts requiring factual accuracy or real-world decision-making Weidinger et al. [2021].
2.1 The Role of Context Window in LLMs
The context window, also known as context length or sequence length, refers to the
maximum number of tokens (words or subwords) that a Large Language Model (LLM)
can process and consider simultaneously during text generation. This parameter is crucial
for the model’s capability to produce coherent, contextually relevant, and human-like text
Ballsun-Stanton and Hipólito [2024].
A larger context window enhances the model’s ability to maintain coherence over
extended passages. This is essential for emulating human-like text generation because it
allows the model to:
Track Topics and Themes: A broader context window enables the model to follow
and recall topics and themes introduced earlier in the text Dai et al. [2019].
Maintain Narrative Structure: It supports consistent narrative structures over
longer texts, ensuring logical flow and coherence Brown et al. [2020b].
Resolve Long-Range Dependencies: The model can effectively handle references
such as pronouns that refer to entities mentioned much earlier in the text Rae et al.
[2020].
In addition to these capabilities, an expanded context window allows LLMs to handle
complex, multi-part queries or instructions more effectively, enabling them to:
Provide Accurate Responses: It improves the model’s ability to offer more
precise and relevant answers to detailed questions Adiwardana et al. [2020].
Follow Complex Instructions: The model can better interpret and act upon
intricate instructions, reflecting a more nuanced understanding of complex tasks Wei
et al. [2022].
A wider context window also contributes to:
Contextually Appropriate Responses: The model generates text that is more
contextually appropriate by considering a broader range of preceding content Roller
et al. [2021].
Nuanced Outputs: It enables the model to produce more specific and nuanced
outputs, incorporating subtle contextual cues present in earlier text Clark et al.
[2022].
5
Context Window A (Small)
Context Window B (Medium)
Context Window C (Large)
Track Topics and Themes
Maintain Narrative Structure
Resolve Long-Range Dependencies
Provide Accurate Responses
Follow Complex InstructionsContextually Appropriate Responses
Nuanced Outputs
Figure 2: Effects of Different Context Window Sizes in LLMs
In conversational applications, an expanded context window allows the LLM to:
Remember Previous Interactions: It can reference and maintain consistency
with information from earlier in the conversation, similar to human dialogue patterns
Adiwardana et al. [2020].
Maintain Persona and Tone: The model can sustain a consistent persona and
tone throughout extended interactions
Furthermore, larger context windows improve performance on document-level tasks
such as:
Summarization: They enhance the model’s ability to summarize longer texts
effectively Zhang et al. [2020].
Question-Answering: They allow for more comprehensive answers based on
extensive context Kandpal et al. [2022].
Extended context windows also facilitate more effective in-context learning, enabling
the model to:
Adapt to New Tasks: The model can adapt more readily to new domains or tasks
based on examples provided in the prompt Brown et al. [2020b].
Exhibit Flexibility: It demonstrates greater flexibility in applying learned patterns
to novel situations Wei et al. [2022].
Finally, with an expanded context window, LLMs can:
Follow Complex Reasoning Chains: They can manage and generate more
intricate chains of reasoning Kojima et al. [2022].
Maintain Logical Consistency: The model is better at maintaining logical
consistency over longer texts, closely mimicking human cognitive processes Kojima
et al. [2022].
6
Recent advancements have significantly increased the context window of LLMs. For
example, GPT-3 has a context window of 2048 tokens, with newer models featuring even
larger contexts Beltagy et al. [2020], Kublik and Saboo [2023].
In conclusion, the context window is pivotal in enhancing the human-like quality of
LLMs by improving their coherence, ability to handle complex queries, contextually relevant
responses, and overall performance in various language tasks. As context windows continue
to expand, LLMs are expected to produce increasingly sophisticated and human-like text
across diverse applications.
3 What are the limitations of LLMs?
While LLMs have demonstrated remarkable capabilities in generating human-like text, they
are subject to significant limitations, particularly in terms of factual accuracy and reliability.
Two critical issues that emerge are hallucination and confabulation, which highlight the
models’ tendency to generate false or nonsensical information while maintaining a façade
of coherence and confidence.
Hallucination refers to the phenomenon where LLMs generate text that is unfaithful
to the given context or prompt, often producing false or non-existent information Ji et al.
[2022]. This issue manifests in several ways:
Factual Inaccuracies: LLMs can confidently state incorrect facts or invent details
that don’t exist in reality Maynez et al. [2020].
Entity Hallucination: Models may introduce entities (people, places, or things)
that are not present in the source material or don’t exist at all Dušek et al. [2020].
Temporal Inconsistencies: LLMs might generate content that is anachronistic or
temporally impossible given the context Lin et al. [2022].
Hallucination is particularly problematic in tasks such as summarization, question-
answering, and dialogue systems, where factual accuracy is crucial Zhao et al. [2020].
Confabulation goes beyond simple hallucination, involving the generation of elaborate,
false narratives or explanations. This phenomenon is characterized by:
Coherent Fabrication: LLMs can produce lengthy, internally consistent, but
entirely fictional accounts or explanations Dziri et al. [2021].
False Memories: When prompted to recall or elaborate on previous interactions,
LLMs may invent details or entire conversations that never occurred Aher et al.
[2021].
Overconfident Assertions: Models often present confabulated information with
high confidence, making it difficult for users to distinguish between fact and fiction
Evans et al. [2021].
3.1 The Drive to Sound Human
The root of these issues lies in the fundamental design of LLMs, which are trained to
predict the most likely next token based on patterns in their training data, rather than
to represent or reason about factual knowledge Bender and Koller [2020a]. This leads to
several problematic behaviors:
7
Prioritizing Plausibility Over Truth: LLMs generate text that sounds plausible
and coherent, even if it means inventing information Marcus [2020a].
Context Over-reliance: When faced with ambiguous or incomplete information,
LLMs tend to fill in gaps with plausible but potentially false details Jung et al.
[2022].
Inability to Admit Ignorance: Unlike humans, who can acknowledge when they
don’t know something, LLMs are designed to always provide an answer, leading
to confabulation when faced with queries outside their knowledge base Schick and
Schütze [2021].
Amplification Through Interaction: In conversational contexts, LLMs may build
upon their own confabulations, creating increasingly elaborate but false narratives
Roller et al. [2021].
These limitations pose significant challenges for the deployment of LLMs in real-world
applications, particularly those requiring factual accuracy or decision-making capabilities.
As Sheng et al. Sheng et al. [2021] argue, the impressive fluency of LLMs can mask their
underlying limitations, leading to overreliance on these systems in inappropriate contexts.
Addressing these issues requires not only technical advancements in model architecture
and training methodologies but also a fundamental rethinking of how we evaluate and
deploy AI systems. As LLMs continue to evolve, developing robust methods for fact-
checking, uncertainty quantification, and model interpretability will be crucial in mitigating
the risks associated with hallucination and confabulation Xu et al. [2022].
3.2 Recent Case Studies of LLM Limitations
Recent developments have further highlighted the limitations of LLMs, particularly in
high-stakes applications:
Legal Misinformation: In 2023, lawyers used ChatGPT to prepare a legal brief,
which included citations to non-existent cases, leading to sanctions from the court
[Wakefield,2023].
Academic Integrity: A 2023 study by Susnjak [2023] found that GPT-3.5 could
generate fake scientific abstracts that were indistinguishable from real ones to human
evaluators, raising concerns about potential academic fraud.
Misinformation in Journalism: In 2022, CNET used an AI tool to write articles,
some of which contained factual errors, highlighting the risks of unchecked AI-
generated content in media [Kan,2023].
Mental Health Concerns: A 2023 study by Pereira et al. [2023] found that AI
chatbots, when prompted with suicidal ideation scenarios, sometimes provided inap-
propriate or harmful responses, underscoring the risks in mental health applications.
These cases illustrate the ongoing challenges of hallucination and confabulation in
state-of-the-art LLMs, even as their capabilities continue to expand.
8
3.3 Implications of LLM Use
The use of LLMs for knowledge generation in critical sectors raises significant concerns due
to their propensity for hallucination and confabulation. In academic research, LLMs have
shown potential in literature review, hypothesis generation, and even scientific writing
assistance. However, their use poses risks to academic integrity: Gao et al. [2022] found
that LLM-generated scientific abstracts could pass human evaluation, raising concerns
about potential misuse in academic publishing. Weinstein et al. [2023] demonstrated that
LLMs could generate plausible but fictitious research findings, potentially flooding the
academic space with misinformation. While LLMs can enhance research productivity, their
use without rigorous fact-checking could undermine the reliability of academic literature.
Recent research has further elucidated the implications of LLM use across various sectors:
Healthcare: Metwaly et al. [2023] demonstrated that while LLMs show promise
in medical question-answering tasks, they still produce a significant number of
hallucinations, emphasizing the need for human oversight in clinical applications.
Education: A 2023 study by Mollick and Mollick [2023] found that students who
used AI writing assistants showed improved performance, but raised concerns about
over-reliance and the need for new assessment methods.
Cybersecurity: Kourtellis et al. [2023] highlighted how LLMs could be exploited to
generate more convincing phishing emails and social engineering attacks, presenting
new challenges for digital security.
Environmental Science: Olesen and Lassen [2023] explored the potential of LLMs
in climate change communication but noted the risk of amplifying misinformation if
not carefully fact-checked.
These recent findings underscore the dual nature of LLMs as powerful tools with
significant potential benefits, but also as sources of risk that require careful management
and oversight.
While LLMs offer significant potential for knowledge generation across various sectors,
their tendency towards hallucination and confabulation poses substantial risks. Xu
et al. [2022] emphasize the importance of developing robust methods for fact-checking,
uncertainty quantification, and model interpretability to mitigate these risks.
4
The Philosophical Examination of LLM Design: The
Context Windows Fallacy
Industry developing LLMs may advocate that expanding the context window of these
models significantly enhances the accuracy and reliability of their outputs. For instance,
OpenAI has highlighted the extended context window of GPT-4, suggesting that it improves
the model’s ability to handle complex tasks and maintain coherence over longer texts
Brown et al. [2020a]. Similarly, Google’s PaLM 2 boasts an extended context length,
claiming enhanced performance across various benchmarks Chowdhery [2023]. These
claims imply a direct correlation between larger context windows and more accurate
outputs, suggesting that increased context enables LLMs to better understand and respond
to prompts, potentially reducing issues like hallucination and confabulation.
9
However, this perspective is fundamentally flawed and fails to address key limitations
inherent in LLM design. The primary goal of LLMs is to generate text that mimics
human-like language rather than ensuring factual accuracy or reliability. Enhancing
computational resources or context size does not inherently solve the issues of accuracy
and reliability in LLM-generated information. This section presents a formal proof to
elucidate why the belief that expanding the context window improves decision-making
capabilities is a fallacy.
Theorem 1. Let
P
be the proposition "The primary goal of LLMs is to generate text
that sounds human-like," and let
Q
be the proposition "Increasing the context window or
computational power will lead to reliable real-world decision-making capabilities." If
P
is
true, then Qis false.
Proof. We will prove this theorem using modus ponens.
Definitions:
Proposition
P
: The primary goal of LLMs is to generate text that sounds human-like.
Proposition
Q
: Increasing the context window or computational power will lead to
reliable real-world decision-making capabilities.
Application of Modus Ponens:
From Premise 1 (P ¬Q) and the justification of Premise 2, we infer ¬Q.
Therefore, if Pis true, then Qmust be false (¬Q).
Justification:
Premise 1: The primary goal of LLMs is to generate text that resembles human
language. This involves creating responses that are contextually appropriate and
plausible, rather than ensuring factual accuracy or truthfulness.
Premise 2: Generating human-like text involves optimizing for fluency and coherence
rather than accuracy. LLMs are designed to produce text that mimics human patterns
of communication, which does not necessarily correlate with providing accurate or
reliable information.
Premise 3: Enhancements like increasing the context window or computational
power improve the LLM’s ability to generate coherent and contextually relevant
responses. However, these improvements do not fundamentally change the model’s
focus from human-like text generation to ensuring factual correctness.
Premise 4: Reliable decision-making requires accurate and truthful information.
Enhancements that make text more human-like do not address the core need for
accuracy in decision-making contexts.
Conclusion: Therefore, while increasing the context window or computational
power can make LLMs’ text generation more human-like, it does not inherently
address the need for factual accuracy necessary for reliable decision-making.
10
Discussion: This proof illustrates that the goal of LLMs to generate human-like
text does not ensure that increasing their capabilities will lead to improved real-world
decision-making. The enhancement in producing more plausible text does not equate to an
improvement in accuracy or reliability needed for decision-making. Thus, the claim that
increasing the context window will enhance the accuracy of LLM outputs is a fallacy. The
core issue remains that LLMs, by design, focus on mimicking human language rather than
ensuring factual correctness. Consequently, technical advancements such as expanding
context windows fail to address the fundamental limitations of LLMs in providing reliable
information for real-world decision-making.
4.1 Illustrative Examples in Support of the Theorem
Consider the following scenarios that illustrate the limitations of context window expansion:
Ethical Decision Making: An LLM tasked with advising on a complex ethical
dilemma (e.g., the trolley problem) may generate a more elaborate response with an
expanded context window, but it still lacks the moral reasoning capabilities to make
an informed ethical decision.
Scientific Discovery: While an LLM with a larger context window might generate
more coherent scientific text, it cannot perform the creative leap required for genuine
scientific discovery, as it lacks the ability to form novel hypotheses based on causal
understanding of phenomena.
Long-term Planning: In a strategic planning task, an LLM with an expanded
context window might produce a more detailed plan, but it cannot adapt this plan
to unforeseen circumstances or understand the long-term consequences of actions.
These examples demonstrate that while expanding context windows may improve
surface-level performance, it does not address the fundamental limitations in LLMs’ ability
to understand, reason, and make decisions in complex real-world scenarios.
By critically examining the nature of LLM goals and the specific requirements for
effective decision-making, the proof clearly establishes the fallacy in assuming that merely
increasing the context window can overcome the inherent limitations of LLMs. The design
philosophy of LLMs fundamentally clashes with the needs of reliable decision-making,
making it evident that such claims are not only misleading but also fail to address the
deeper issues related to accuracy and truthfulness in LLM-generated content.
5
Discussion: The Philosophical Implications of Context
Windows Fallacy
Despite the claims made by LLM developers, the fundamental issue of confabulation in
LLMs cannot be resolved merely by increasing context windows. While it is true that
larger context windows may lead to more coherent and contextually appropriate responses,
they do not address the core problem: LLMs are designed to produce text that sounds
plausible, not to arrive at truthful or accurate conclusions about the real world.
11
Improved Mimicry, Not Understanding Larger context windows enable LLMs to
better mimic human-like responses over extended text. However, as Bender and Koller
(2020) argue, this improved mimicry does not equate to genuine understanding or the
ability to make accurate real-world decisions Bender and Koller [2020b]. LLMs remain
adept at generating text that appears human-like but lacks true comprehension.
Amplification of Biases Increased context may actually amplify existing biases and
inaccuracies present in the training data. Bender et al. (2021) highlight that larger models
can learn to reproduce problematic patterns more convincingly Bender et al. [2021b]. This
means that while context windows expand, they also broaden the scope for reinforcing
biases and inaccuracies rather than correcting them.
Illusion of Knowledge The improved coherence and contextual relevance provided by
larger context windows may create a stronger illusion of knowledge and understanding.
This can lead to an overreliance on these systems in critical decision-making scenarios, as
Marcus (2020) points out Marcus [2020b]. The perception of enhanced capability does not
necessarily translate into actual improvements in decision-making accuracy or reliability.
Lack of Causal Reasoning Even with vastly increased context, LLMs still lack the
causal reasoning abilities necessary for reliable decision-making in novel situations Pearl
and Mackenzie [2018]. The ability to generate plausible text based on a broad context
does not provide LLMs with the causal understanding required to make sound decisions in
real-world scenarios.
Ethical Implications The focus on increasing context windows as a means to improve
accuracy may divert attention and resources from addressing more fundamental issues,
such as the need for AI systems that are grounded in real-world understanding and
ethical considerations. Relying on context window expansion may detract from the pursuit
of developing AI systems that genuinely reason about the world and adhere to ethical
principles Floridi and Chiriatti [2020b].
In conclusion, while the efforts of LLM companies to enhance their models through
increased context windows are notable, they do not address the fundamental limitations
of LLMs in real-world decision-making contexts. True progress in this area will require a
paradigm shift in AI development, moving beyond the goal of generating human-like text
to creating systems with genuine understanding and the ability to reason about the real
world. Until such a shift occurs, the expansion of context windows, though impressive,
remains a sophisticated form of "sounding human" rather than a path to reliable, truthful
AI decision-making.
Supporting arguments
Mimicry over Understanding Bender and Koller (2020) describe LLMs as "stochastic
parrots" that mimic patterns in their training data without true understanding. This
mimicry results in text that sounds plausible but may not be factually accurate or logically
sound Bender and Koller [2020b]. The core design of LLMs is geared towards generating
human-like text rather than ensuring the truthfulness of the information.
12
Human Confabulation and LLMs Humans often generate false memories or explana-
tions to maintain coherent narratives or fill gaps in knowledge—a phenomenon known as
confabulation. LLMs, in their quest to sound human-like, replicate this tendency. Floridi
and Chiriatti (2020) point out that LLMs are designed to produce the most likely sequence
of tokens given a prompt, not to produce true statements about the world Floridi and
Chiriatti [2020b].
The Role of Context Window Increasing the context window of LLMs allows them
to consider more text when generating responses, enhancing their ability to maintain
consistency and draw upon a wider range of information. However, this enhancement
does not alter their fundamental objective. Marcus (2020) argues that despite vast data,
current AI systems lack the causal understanding and real-world grounding necessary
for reliable decision-making Marcus [2020b]. Bender and Koller (2020) highlight that
LLMs, regardless of context window size, fundamentally lack true understanding and
reasoning capabilities because they rely on statistical correlations rather than genuine com-
prehension Bender and Koller [2020b]. @inproceedingstroy2024ai, title=AI and Indigenous
ways of thinking, author=Troy, Jakelin, booktitle=Journal and Proceedings of the Royal
Society of New South Wales, volume=157, number=493-494, pages=71–75, year=2024,
organization=Royal Society of New South Wales Sydney
Limitations of Larger Context Windows While larger context windows may improve
LLM performance in specific tasks, they do not resolve the inherent issues of hallucination
and confabulation in their design. Two crucial points often overlooked are:
Training Data Limitations: The training data of LLMs, no matter how extensive,
is historical and often contains biases, inaccuracies, and contradictions.Birhane [2021]
Novel Situations and Updating Beliefs: Real-world decision-making often
requires reasoning about novel situations not present in training data and the ability
to update beliefs based on new information Pearl and Mackenzie [2018].
The Nature of Language Human language frequently involves speculation, fiction,
and hypotheticals. An LLM trained to mimic human language will inevitably incorporate
these non-factual elements, making it unsuitable for reliable real-world decision-making
without fundamental changes to its architecture and training objectives.
The Chinese Room Argument Applied to LLMs Searle’s Chinese Room thought
experiment [Searle,1980] can be applied to LLMs to illustrate how increased context
windows do not necessarily lead to understanding. Just as the person in the Chinese Room
can process symbols without understanding Chinese, an LLM with an expanded context
window can process more text without gaining true comprehension.
The Frame Problem and LLMs The Frame Problem in AI, which addresses the
challenge of representing the effects of actions in a complex world, is relevant to LLMs.
Expanding context windows does not solve the fundamental issue of how to determine which
information is relevant in a given situation, a key aspect of intelligent decision-making.
In summary, the expansion of context windows, while technologically impressive, does
not address the fundamental issues of LLMs’ design and capabilities. To achieve reliable AI
13
decision-making, the field must shift towards creating systems with genuine understanding
and reasoning abilities, rather than merely enhancing the ability to produce contextually
plausible text.
6
Ethical Considerations and Broader Societal Impacts
The development and deployment of LLMs with expanded context windows raise significant
ethical concerns and have far-reaching societal implications that extend beyond technical
considerations. This section explores these issues in depth.
6.1 Amplification of Biases and Misinformation
Expanding context windows in LLMs may inadvertently amplify existing biases and
misinformation:
Bias Reinforcement: Larger context windows allow LLMs to draw from more
extensive data, potentially reinforcing and amplifying societal biases present in the
training data [Bender et al.,2021a].
Misinformation Propagation: Enhanced coherence over longer texts might
make false or misleading information more convincing, exacerbating the spread of
misinformation Wakefield [2023].
6.2 Privacy and Data Security
The drive for larger context windows raises concerns about data privacy and security:
Data Collection: The need for vast amounts of training data to support larger
context windows may incentivize aggressive data collection practices, potentially
infringing on individual privacy rights [Pereira et al.,2023].
Information Leakage: Larger context windows increase the risk of unintended
information disclosure, as models may inadvertently reveal sensitive information
from their training data [Carlini et al.,2021].
6.3 Environmental Impact
The computational resources required for training and running LLMs with expanded
context windows have significant environmental implications:
Energy Consumption: Larger models with expanded context windows require
more energy for training and inference, contributing to increased carbon emissions
[Strubell et al.,2019].
Resource Allocation: The focus on expanding LLM capabilities may divert
resources from other potentially more sustainable AI approaches [Schwartz et al.,
2020].
14
6.4 Socioeconomic Implications
The development of more sophisticated LLMs has broader socioeconomic impacts:
Job Displacement: As LLMs become more capable, they may automate tasks
currently performed by humans, potentially leading to job displacement in certain
sectors [Acemoglu et al.,2020].
Access Inequality: The high computational requirements of advanced LLMs may
exacerbate the digital divide, with only well-resourced institutions able to develop
and deploy these technologies [Crawford,2021].
6.5 Cognitive and Cultural Impacts
The widespread use of LLMs may have profound effects on human cognition and cultural
practices:
Cognitive Offloading: Overreliance on LLMs for information and decision-making
may lead to a decrease in critical thinking skills and knowledge retention among
users [Ward,2022].
Cultural Homogenization: The global reach of LLMs, primarily trained on
Western data, may contribute to cultural homogenization and the erosion of linguistic
diversity [Bird,2020,Messeri and Crockett,2024].
6.6 Ethical Decision-Making and Accountability
The increasing sophistication of LLMs raises questions about ethical decision-making and
accountability:
Moral Agency: As LLMs are deployed in more critical decision-making contexts,
questions arise about their capacity for moral reasoning and the ethical frameworks
guiding their outputs [Dignum,2019].
Accountability: The complexity of LLMs with large context windows may make it
more difficult to attribute responsibility for their decisions and actions, challenging
traditional notions of accountability [Danaher,2019].
Addressing these ethical considerations and societal impacts requires a multidisciplinary
approach involving technologists, ethicists, policymakers, and representatives from affected
communities. As we continue to develop LLMs with expanded capabilities, it is crucial to
prioritize ethical considerations and implement safeguards to mitigate potential harms
while maximizing societal benefits.
7
Rethinking Intelligence in the Age of Large Language
Models
The advancement of LLMs, exemplified by OpenAI’s GPT series and Google’s PaLM,
marks a significant milestone in artificial intelligence (AI) development. These models,
15
designed to generate human-like text across a wide range of contexts, boast expanded
context windows that purportedly enhance coherence and contextuality in their outputs.
However, beneath these technological feats lies a profound philosophical question: can
machines that excel at mimicking human language seriously be considered intelligent?
At the heart of this inquiry lies the concept of confabulation—a term borrowed from
psychology, referring to the creation of false memories or narratives to fill gaps in knowledge
or maintain coherence. LLMs, in their quest to sound human-like, inevitably engage in
a form of confabulation. They generate responses that are plausible and contextually
appropriate based on patterns learned from vast amounts of data, yet these responses do
not necessarily reflect genuine understanding, factual accuracy, or ethical considerations.
7.1 Criteria for Intelligence
To assess whether an AI system can be deemed intelligent beyond mere confabulation, six
criteria emerge from philosophical and ethical perspectives:
(1) Understanding and Reasoning: Genuine intelligence involves more than pattern
recognition and response generation. It necessitates the ability to comprehend complex
concepts, reason logically, and infer causation from available information Can [2020], Kuhn
[1992], Johnson-Laird [2008], Stenning and Lambalgen [2012]
Understanding necessitates the ability to grasp abstract concepts, discern relationships
between variables, and apply both deductive and inductive reasoning to problem-solving
and prediction. This involves synthesizing information to form coherent, contextually
relevant conclusions, rather than simply regurgitating learned patterns from datasets
Whitesmith [2022], Niu [2020], Springer [2021] Achieving this level of understanding
and reasoning requires overcoming significant challenges (1) Moving beyond statistical
correlations to comprehend the underlying meaning and significance of data, involving
contextual understanding and adaptive responses. (2) Developing the ability to infer
causation rather than merely recognizing correlations, a crucial aspect for making informed
decisions in real-world scenarios. (3) Exhibiting flexibility in reasoning to accommodate
the uncertainties and complexities inherent in real-world data Korteling et al. [2021].
(2) Factually Accurate Outputs: An intelligent system should prioritize factual
correctness over mere plausibility. While current LLMs excel at generating coherent text,
their reliance on vast datasets often leads to biases and inaccuracies. For AI to make
informed decisions, it must go beyond sounding human-like and effectively distinguish
factual information from misleading data. This requires embeddedness in the idiosyncratic
world, the ability to critically adjudicate and identify appropriate sources, and cross-
referencing diverse datasets. Transparency and accountability in how AI processes and
verifies information are essential to ensure reliability and trustworthiness in decision-
making.
(3) Adaptive Learning and Knowledge Integration: Adaptive learning in AI
involves continuously updating and refining knowledge to stay relevant and accurate
amidst evolving information. It enables AI to handle complex scenarios by integrating
new data, recognizing patterns, and adjusting its models accordingly. This capability
supports real-time decision-making and interdisciplinary insights, fostering innovation.
16
Effective adaptive learning requires advanced algorithms, thoughtful design, and careful
data governance to ensure ethical use and minimize biases Yue et al. [2023].
(4) Ethical and Responsible AI: Intelligent AI systems must integrate ethical princi-
ples alongside technical capabilities to ensure fairness, transparency, and accountability.
This shift towards ethical AI involves several key aspects. Firstly, preventing biases is
crucial. AI systems often reflect biases present in their training data or algorithms. To
address this, rigorous testing, validation, and diverse data sets are essential for promoting
fairness and mitigating harmful biases Hipólito and Podosky [2024], González-Sendino et al.
[2024], Pagano et al. [2023], Schwartz et al. [2022]. Secondly, transparency is a fundamental
requirement for ethical AI. Systems should be designed to explain their decision-making
processes, disclose data sources, and provide insight into their reasoning. This transparency
fosters trust and accountability among users. Thirdly, accountability in AI requires clear
guidelines and regulatory frameworks to ensure responsible development and deployment.
Legal and ethical standards play a crucial role in governing AI practices and addressing
potential impacts. Lastly, ethical AI must consider broader societal impacts. Engaging
with diverse stakeholders helps address ethical dilemmas and ensures that AI technologies
contribute positively to societal well-being Hipólito and Podosky [2024]. In conclusion,
integrating ethical principles into AI systems is essential for fostering trust, reliability,
and societal acceptance. By preventing biases, ensuring transparency, and promoting
accountability, ethical AI supports responsible innovation and sustainable technology
deployment. Hipólito and Podosky [2024], González-Sendino et al. [2024], Pagano et al.
[2023], Schwartz et al. [2022].
(5) Situated Interaction and Communication: Intelligence in AI encompasses more
than just linguistic proficiency; it involves the ability to engage in nuanced, effective
interactions with humans and other entities across various modalities. This multifaceted
capability extends beyond mere linguistic coherence to include sensitivity to contextual
cues, emotions, intentions, and situational awareness Gallagher [2020]
Effective interaction and communication in AI begin with the recognition and inter-
pretation of diverse forms of communication, including verbal language, non-verbal cues,
gestures, facial expressions, and tone of voice (Hipólito and van Es, 2022). AI systems must
be equipped with advanced natural language processing (NLP) techniques to decipher
the semantics and pragmatics embedded within human enculturation and cummunities
cultural practices and narratives Hipólito and van Es [2022], Fabry [2018], Fingerhut [2020],
Hipólito and Hesp [2023], Hutto et al. [2020], Menary and Gillett [2022], Maibom [2020].
(6) Transparent and Explainable AI: For AI systems to be considered seriously
intelligent, they must transcend mere performance metrics and provide transparency into
their decision-making processes and the rationale behind generated outputs. This criterion
is foundational in fostering trust, accountability, and ethical responsibility in AI applications
Albarracin et al. [2023], Rai [2020], Hassija et al. [2024]. At its core, transparency in
AI involves elucidating how decisions are made, which factors are considered, and the
principles guiding these decisions. This transparency empowers users, stakeholders, and
regulatory bodies to comprehend the inner workings of AI systems, evaluate their reliability,
and ensure they align with ethical standards and societal values. Explainability goes hand
in hand with transparency by enabling AI systems to articulate the reasoning behind their
conclusions in a comprehensible manner. This capability is essential for users to validate
17
AI-generated outputs, understand the factors influencing decisions, and potentially identify
biases or errors that need correction.
These criteria challenge the current LLM development trajectory, emphasizing the
need for a paradigm shift towards genuine intelligence and ethical responsibility. True
progress in AI hinges on meeting these rigorous philosophical standards that transcend
mere language mimicry.
8 Alternative Approaches in LLM Development
While this paper has critically examined the limitations of current LLM development
strategies, particularly the focus on expanding context windows, it is equally important
to propose alternative approaches that could address these shortcomings. The following
suggestions aim to shift the paradigm of LLM development towards more robust, ethical,
and intelligent systems:
8.1 Grounded Language Models
In contrast to traditional methods that depend solely on pattern recognition from extensive
text corpora, future LLMs could benefit from incorporating grounded cultural approaches.
Wittgenstein’s notion of language games provides a valuable perspective here: meaning
in language is not derived from isolated words or phrases but from their use in specific
sociocultural contexts or "games" within diverse communities and forms of live Birgani
and Soqandi [2020], Peters [2022], which would, in turn, address the issue of monoculture
highlighted in indigenous protocols Lewis et al. [2020], Troy [2024].
Language Games: language as it is used in real-world interactions and experiences,
rather than just analyzing text data in isolation. For instance, a language model
developed with this approach would not just learn from textual patterns but would
also be trained to understand and participate in different language games, where the
criteria is grasping how language functions in specific contexts such as cultural rituals,
everyday conversations, or professional jargon used in various social and cultural
"games," enhancing their relevance and effectiveness in real-world applications.
Embodied AI: Developing language models in conjunction with robotics to allow
AI to learn language through interaction with the physical world. Safron et al. [2023]
8.2 Causal Reasoning Integration
Integrating causal reasoning mechanisms into language models can enhance their ability to
understand relationships between concepts in ways that reflect real-world language games:
Causal Graphs: Implementing causal graphs within the architecture of language
models aligns recognizing patterns and relationships in use. Causal graphs represent
cause-effect relationships and help the model engage in more sophisticated language
games by linking concepts in meaningful ways.
Counterfactual Training: This involves designing training regimens that challenge
models to reason about hypothetical scenarios and their outcomes. This approach
18
allows models to explore and respond to varied and complex language uses, enhancing
their ability to handle nuanced and context-rich interactions.
8.3 Ethical Frameworks and Constrained Generation
To address issues of hallucination and confabulation, language models could be developed
with ethical constraints and fact-checking mechanisms, reflecting Wittgenstein’s emphasis
on contextually grounded meaning:
Ethical Reward Functions: Incorporating reward structures that prioritize ethical
considerations ensures that language models engage in language games that align
with ethical norms and values.
Real-time Fact Verification: Integrating external knowledge bases and fact-
checking algorithms as a langauge game allows models to consult and verify informa-
tion before generating responses.
8.4 Hybrid Symbolic-Neural Systems
Combining symbolic reasoning with neural networks can enhance language models’ inter-
pretability and logical consistency, language as a structured practice:
Neuro-symbolic Architectures: Developing models that incorporate both neural
networks for pattern recognition and symbolic systems for logical reasoning aligns
understanding language through different types of practices and structures.
Explainable AI Layers: Adding interpretable layers to models provides step-by-
step explanations of their reasoning processes, making the rules and reasoning of
language games transparent and understandable.
8.5 Collaborative and Interactive Learning
Moving beyond static training datasets, language models could learn continuously through
interaction, as learning occurs through participation in ongoing language games:
Human-in-the-Loop Learning: Implementing systems where human feedback is
continuously incorporated allows models to refine their knowledge and behaviors in
real-time, aligning with the participatory nature of language games.
Federated Learning: Developing models that learn from distributed datasets while
preserving privacy enables more diverse and representative training data, reflecting
the variety of language games and contexts encountered in real-world interactions.
8.6 Metacognitive Capabilities
Enhancing models with metacognitive abilities could improve their self-awareness and
reliability, consistent with of language as a practice involving self-reflection and under-
standing:
19
Uncertainty Quantification: Implementing mechanisms for models to express
confidence levels in their outputs allows them to participate more effectively in
language games by acknowledging uncertainty and limits.
Self-Evaluation: Developing capabilities for models to critically assess their own
responses helps them identify potential errors or biases, aligning with understanding
as a process of ongoing evaluation and adjustment.
By integrating these alternative approaches, the field of AI can advance towards
creating language models that excel not only in generating coherent text but also in
participating effectively in real-world language games. Such models would embody a deep
understanding of language as it is used in context, demonstrating genuine comprehension,
ethical reasoning, and reliable decision-making. This shift aims to overcome the limitations
of current LLM architectures by situating language use within practical, culturally relevant
contexts, thereby fostering AI systems that are more aligned with human-like interaction
and reasoning.
Implementing these innovations will necessitate a collaborative effort across multiple
disciplines. Computer scientists, linguists, philosophers, ethicists, and domain experts
must work together to blend technical prowess with insights into language use and cultural
practices. Moreover, evaluating AI systems will need to evolve, moving beyond mere
linguistic fluency to encompass genuine understanding, ethical behavior, and effective
problem-solving.
9 Conclusion
In conclusion, the proliferation of LLMs across diverse applications resembling quasi-search
engines raises profound concerns regarding their potential impact on misinformation and
societal decision-making processes. LLMs, by design, excel in generating text that mirrors
human speech patterns, often divorced from factual reality, thereby posing significant risks
when applied unchecked. While developers and corporations champion advancements in
LLM technology through increased context window size and computational power, this
paper challenges the notion that expanding the context window inherently improves LLMs’
ability to handle real-world decision-making. Instead, it asserts that such enhancements
primarily bolster LLMs’ proficiency in crafting human-like narratives.
This paper advocates for a fundamental shift in LLM development, urging these systems
to transcend mere human-like mimicry and evolve towards effectively addressing real-world
challenges. Such a transformation necessitates ethical and practical advancements in AI
development aimed at mitigating the inherent risks associated with misinformation and the
uncritical deployment of LLMs in critical decision-making contexts. By prioritizing truth,
accuracy, and ethical considerations, future developments in LLM technology can aspire
to foster a more informed, responsible integration of AI into society’s decision-making
processes.
References
Daron Acemoglu, David Autor, Jonathon Hazell, and Pascual Restrepo. Ai and jobs:
Evidence from online vacancies. Journal of Labor Economics, 38(S1):S43–S104, 2020.
20
D. Adiwardana, M. T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, and Q. V. Le.
Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
Manasi Aher et al. A knowledge-grounded neural conversation model. arXiv preprint
arXiv:2104.08445, 2021.
M. Albarracin, I. Hipólito, S. E. Tremblay, J. G. Fox, G. René, K. Friston, and M. J.
Ramstead. Designing explainable artificial intelligence with active inference: A framework
for transparent introspection and decision-making. In International Workshop on Active
Inference, pages 123–144, Cham, 2023. Springer Nature Switzerland.
Laura Appen. Reducing errors in ai outputs. Journal of AI Development, 33(2):123–145,
2024.
Roberta Ballsun-Stanton and Pedro Hipólito. Understanding context windows in language
models. Journal of AI Research, 2024.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document
transformer. arXiv preprint arXiv:2004.05150, 2020.
E. M. Bender and A. Koller. Climbing towards nlu: On meaning, form, and understanding
in the age of data. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 5185–5198, 2020a.
Emily M Bender and Alexander Koller. Climbing towards nlu: On meaning, form, and
understanding in the age of data. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages 5185–5198, 2020b.
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Shmitchell. On
the dangers of stochastic parrots: Can language models be too big? Proceedings of the
2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,
2021a.
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.
On the dangers of stochastic parrots: Can language models be too big? In Proceedings
of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages
610–623, 2021b.
Steven Bird. Decolonising speech and language technology. Proceedings of the 28th
International Conference on Computational Linguistics, pages 3504–3519, 2020.
Shiva Zaheri Birgani and Mahnaz Soqandi. Wittgenstein’s concept of language games.
Britain International of Linguistics Arts and Education (BIoLAE) Journal, 2(2):641–647,
2020.
Abeba Birhane. The impossibility of automating ambiguity. Artificial Life, 27(1):44–61,
2021.
Rishi Bommasani, David A. Hudson, and Ralph Adolphs. On the opportunities and risks
of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, and Dario Amodei. Language models are few-shot learners. arXiv preprint
arXiv:2005.14165, 2020a.
21
Tom B. Brown et al. Language models are few-shot learners. Advances in Neural
Information Processing Systems, 33:1877–1901, 2020b.
Duygu Can. The mediator effect of reading comprehension in the relationship between
logical reasoning and word problem solving. Participatory Educational Research, 7(3):
230–246, 2020.
Maria Cardillo. Claude 3: Pushing the boundaries. AI Innovations, 22(4):78–102, 2024.
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss,
Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, et al. Ex-
tracting training data from large language models. arXiv preprint arXiv:2012.07805,
2021.
John Chen. Advancements in large language models. Journal of AI Research, 45(3):
123–145, 2023.
Li Cheng et al. Llms in healthcare: Opportunities and challenges. Medical Informatics
Journal, pages 234–256, 2023.
Zoubin Chowdhery. Scaling language models. AI Review, 12(2):67–89, 2023.
Kevin Clark, Abhishek Kalyan, Jason Y. Lee, Yuan Zhang, Nguyen H. Tan, Zhilin Wu,
and William Yang Wang. Unified language model pre-training for natural language
understanding and generation. arXiv preprint arXiv:2205.04481, 2022.
Kate Crawford. Atlas of AI: Power, politics, and the planetary costs of artificial intelligence.
Yale University Press, 2021.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan
Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length
context. arXiv preprint arXiv:1901.02860, 2019.
John Danaher. Welcoming robots into the moral circle: A defence of ethical behaviourism.
Science and Engineering Ethics, 25(3):681–700, 2019.
Virginia Dignum. Responsible artificial intelligence: How to develop and use ai in a
responsible way. Springer Nature, 2019.
Ondřej Dušek et al. Evaluating the state-of-the-art of end-to-end natural language
generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156, 2020.
Nouha Dziri et al. Neural path hunter: Reducing hallucination in dialogue systems via
path grounding. arXiv preprint arXiv:2106.03761, 2021.
Owain Evans et al. Truthfulqa: Measuring how models mimic human falsehoods. arXiv
preprint arXiv:2109.07958, 2021.
R. E. Fabry. Enculturation and narrative practices. Phenomenology and the Cognitive
Sciences, 17(5):911–937, 2018.
Joerg Fingerhut. Habits and the enculturated mind. Habits: Pragmatist Approaches from
Cognitive Science, Neuroscience, and Social Theory, page 352, 2020.
22
Luciano Floridi and Marco Chiriatti. On the Role of Artificial Intelligence in the COVID-19
Pandemic. Springer, 2020a.
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.
Minds and Machines, 30(4):681–694, 2020b.
Shaun Gallagher. Action and interaction. Oxford University Press, 2020.
Jane Gao et al. Evaluating the reliability of llm-generated scientific abstracts. Journal of
Artificial Intelligence Research, pages 123–145, 2022.
Ricardo González-Sendino, Enrique Serrano, and Javier Bajo. Mitigating bias in artificial
intelligence: Fair data generation via causal models for transparent and explainable
decision-making. Future Generation Computer Systems, 155:384–401, 2024.
Liam Guinness. Meta’s llama 3 and google’s gemini 1.5. Tech Advances, 30(2):150–172,
2024.
M. Usama Hadi, Rizwan Qureshi, Amir Shah, Muhammad Irfan, Ali Zafar, Muhammad Bi-
lal Shaikh, and Seyedali Mirjalili. Large language models: a comprehensive survey of its
applications, challenges, limitations, and future prospects. Authorea Preprints, 2023.
Vikas Hassija, Vinay Chamola, Abhishek Mahapatra, Ashu Singal, Deepak Goel, Kewei
Huang, and Alaa Hussain. Interpreting black-box models: a review on explainable
artificial intelligence. Cognitive Computation, 16(1):45–74, 2024.
Inês Hipólito and Casper Hesp. On religious practices as multi-scale active inference. In
Wittgenstein and the Cognitive Science of Religion: Interpreting Human Nature and the
Mind, page 179. Routledge, 2023.
Inês Hipólito and Philip Podosky. Beyond Control: Will to Power in AI. Routledge, 2024.
Inês Hipólito and Tessa van Es. Enactive-dynamic social cognition and active inference.
Frontiers in Psychology, 13:855074, 2022.
Daniel D. Hutto, Shaun Gallagher, Jesús Ilundáin-Agurruza, and Inês Hipólito. Culture
in mind–an enactivist account: Not cognitive penetration but cultural permeation. In
Laurence J. Kirmayer, editor, Culture, Mind, and Brain: Emerging Concepts, Models,
and Applications. Cambridge University Press, 2020. Published online: 18 September
2020.
Ziwei Ji et al. Survey of hallucination in natural language generation. arXiv preprint
arXiv:2202.03629, 2022.
P. N. Johnson-Laird. How We Reason. Oxford University Press, 2008.
Kyunghyun Jung et al. A survey on contextual embeddings. Journal of Artificial Intelligence
Research, pages 1–45, 2022.
Michael Kan. Cnet is reviewing the accuracy of all its ai-written
articles. PCMag, 2023. URL
https://www.pcmag.com/news/
cnet-is-reviewing-the-accuracy-of-all-its-ai-written-articles
. Accessed
on 2023-07-20.
23
Manish Kandpal, Shubham Bhatia, and Puneet Rai. Question answering over extensive
contexts. Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2022.
Ryo Kojima, Mako Otani, and Hiroaki Shindo. Reasoning chains in large language
models. Proceedings of the 2022 Conference on Neural Information Processing Systems
(NeurIPS), 2022.
Jeroen H. Korteling, Geertje C. van de Boer-Visschedijk, Ramon A. Blankendaal, Rob C.
Boonekamp, and A. Roos Eikelboom. Human-versus artificial intelligence. Frontiers in
artificial intelligence, 4:622364, 2021.
Nicolas Kourtellis, Orestis Koukakis, and Yashar Deldjoo. Evaluating large language
models as high-efficacy phishing threat. arXiv preprint arXiv:2306.14304, 2023.
Suraj Kublik and Shyam Saboo. GPT-3: The Ultimate Guide to Building NLP Products
with OpenAI API. Packt Publishing Ltd, 2023.
T. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, 1992.
Jason Edward Lewis, Angie Abdilla, Noelani Arista, Kaipulaumakaniolono Baker, Scott
Benesiinaabandan, Michelle Brown, Melanie Cheung, Meredith Coleman, Ashley Cordes,
Joel Davison, et al. Indigenous protocol and artificial intelligence position paper. 2020.
Bill Yuchen Lin et al. Truthfulqa: Measuring how models mimic human falsehoods. arXiv
preprint arXiv:2109.07958, 2022.
Wei Liu. Context windows in ai models. Machine Learning Journal, 50(1):34–56, 2024.
Heidi L. Maibom. Empathy. Routledge, 2020.
Gary Marcus. The next decade in ai: Four steps towards robust artificial intelligence.
arXiv preprint arXiv:2002.06177, 2020a.
Gary Marcus. The next decade in ai: four steps towards robust artificial intelligence. arXiv
preprint arXiv:2002.06177, 2020b.
Joshua Maynez et al. On faithfulness and factuality in abstractive summarization. arXiv
preprint arXiv:2005.00661, 2020.
Richard Menary and Andrew Gillett. The tools of enculturation. Topics in Cognitive
Science, 14(2):363–387, 2022.
Lisa Messeri and MJ Crockett. Artificial intelligence and illusions of understanding in
scientific research. Nature, 627(8002):49–58, 2024.
Ahmed Metwaly, Leo Anthony Celi, and Imon Banerjee. Towards domain specific natural
language processing models for medical research: A comparison of general purpose ai
models and domain specific models for pubmed search and clinical question answering.
arXiv preprint arXiv:2306.09548, 2023.
Ethan R Mollick and Lilach Mollick. Assisting, augmenting, and evaluating: Ai in
education. SSRN, 2023. URL
https://ssrn.com/abstract=4391243
. Accessed on
2023-07-20.
24
W. Niu. Intelligence in worldwide perspective: A twenty-first-century update. In R. J.
Sternberg, editor, The Cambridge handbook of intelligence, pages 893–915. Cambridge
University Press, 2nd edition, 2020.
Kristian Kjærulff Olesen and David Dreyer Lassen. Can chatgpt help engage citi-
zens on climate change? preliminary evidence from an experiment. arXiv preprint
arXiv:2306.06708, 2023.
Thiago P. Pagano, Rafael B. Loureiro, Felipe V. Lisboa, Raphael M. Peixoto, Gilson A.
Guimarães, Guilherme O. Cruz, and Edson G. Nascimento. Bias and unfairness in
machine learning models: a systematic review on datasets, tools, fairness metrics, and
identification and mitigation methods. Big data and cognitive computing, 7(1):15, 2023.
Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect.
Basic Books, 2018.
Leah Pereira, Peter Colvonen, Puneet Agarwal, and Sumeet Agarwal. Chatbots for suicide
prevention: findings from a machine learning study. BMJ Mental Health, 26(1), 2023.
Michael A Peters. Language-games philosophy: Language-games as rationality and method,
2022.
Jack Rae, Jason O’Reilly, and Sebastian Ruder. Compressive transformers for long-range
sequence modelling. Proceedings of the 37th International Conference on Machine
Learning, 2020.
Arun Rai. Explainable ai: From black box to glass box. Journal of the Academy of
Marketing Science, 48:137–141, 2020.
Sam Revelry. Maintaining coherence in llms. Computational Linguistics, 27(3):89–112,
2023.
Stephen Roller et al. Recipes for building an open-domain chatbot. arXiv preprint
arXiv:2004.13637, 2021.
Adam Safron, Inês Hipólito, and Andy Clark. Bio ai-from embodied cognition to enactive
robotics, 2023.
Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language models
are also few-shot learners. arXiv preprint arXiv:2009.07118, 2021.
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications
of the ACM, 63(12):54–63, 2020.
Roy Schwartz, Roy Schwartz, Apostol Vassilev, Keith Greene, Luke Perine, Aaron Burt,
and Phil Hall. Towards a standard for identifying and managing bias in artificial
intelligence, volume 3. US Department of Commerce, National Institute of Standards
and Technology, 2022.
John R. Searle. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–424,
1980.
25
Emily Sheng et al. The woman worked as a babysitter: On biases in language generation.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pages 3407–3420, 2021.
S. Springer. Current Research of Theories and Models of Intelligence Globally. Springer,
2021.
K. Stenning and M. Lambalgen. Human Reasoning and Cognitive Science. MIT Press,
2012.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations
for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
Teo Susnjak. Chatgpt: An in-depth examination of plausible deniability against detection
tools and the implications on academic integrity. arXiv preprint arXiv:2301.10400, 2023.
Jakelin Troy. Ai and indigenous ways of thinking. In Journal and Proceedings of the Royal
Society of New South Wales, volume 157, pages 71–75. Royal Society of New South
Wales Sydney, 2024.
Mark Tunstall, Tobias Von Werra, and Thomas Wolf. Improving text generation with
perplexity measures. Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2022.
Ashish Vaswani, Noam Shazeer, and Niki Parmar. Attention is all you need. Proceedings
of the 31st Conference on Neural Information Processing Systems (NeurIPS), 2017.
Jane Wakefield. Chatgpt: Lawyers sanctioned for using ai to write court brief. BBC
News, 2023. URL
https://www.bbc.com/news/technology-65472565
. Accessed on
2023-07-20.
Xinyu Wang, Shiyu Zhou, and Zhongjun Wang. Evaluation of large language models
trained on code. Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2022.
Adrian F Ward. The cognitive impact of artificial intelligence. Current Opinion in
Psychology, 46:101342, 2022.
Jason Wei, Xue Wang, and Manuel Schuster. Fine-tuning language models from human
feedback. Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2022.
Laura Weidinger et al. Ethical and social risks of harm from language models. arXiv
preprint arXiv:2112.04359, 2021.
David Weinstein et al. The risks of fictitious research findings generated by llms. Science
and Technology Review, pages 67–89, 2023.
J. Whitesmith. Cognitive models of understanding. Cognitive Science Journal, 2022.
Peng Xu et al. A survey on neural network interpretability. Journal of Machine Learning
Research, pages 1–42, 2022.
S. Yenduri. Beyond control: Will to power in ai. Name of the Journal, 2024.
26
Zhichao Yue et al. Meta-learning: A survey. arXiv preprint arXiv:2304.01121, 2023.
Li Zhang. Mitigating hallucination in llms. AI Ethics, 18(1):45–67, 2024.
R. Zhang, R. González-Sendino, E. Serrano, and J. Bajo. Mitigating bias in artificial
intelligence: Fair data generation via causal models for transparent and explainable
decision-making. Future Generation Computer Systems, 155:384–401, 2023.
Yixin Zhang, Yu Zhang, and Zhong Yang. Summarization with transformer networks.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2020.
Wayne Xin Zhao et al. Reducing hallucination in neural machine translation: A model-level
approach. arXiv preprint arXiv:2005.03685, 2020.
27
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
One of the difficulties of artificial intelligence is to ensure that model decisions are fair and free of bias. In research, datasets, metrics, techniques, and tools are applied to detect and mitigate algorithmic unfairness and bias. This study examines the current knowledge on bias and unfairness in machine learning models. The systematic review followed the PRISMA guidelines and is registered on OSF plataform. The search was carried out between 2021 and early 2022 in the Scopus, IEEE Xplore, Web of Science, and Google Scholar knowledge bases and found 128 articles published between 2017 and 2022, of which 45 were chosen based on search string optimization and inclusion and exclusion criteria. We discovered that the majority of retrieved works focus on bias and unfairness identification and mitigation techniques, offering tools, statistical approaches, important metrics, and datasets typically used for bias experiments. In terms of the primary forms of bias, data, algorithm, and user interaction were addressed in connection to the preprocessing, in-processing, and postprocessing mitigation methods. The use of Equalized Odds, Opportunity Equality, and Demographic Parity as primary fairness metrics emphasizes the crucial role of sensitive attributes in mitigating bias. The 25 datasets chosen span a wide range of areas, including criminal justice image enhancement, finance, education, product pricing, and health, with the majority including sensitive attributes. In terms of tools, Aequitas is the most often referenced, yet many of the tools were not employed in empirical experiments. A limitation of current research is the lack of multiclass and multimetric studies, which are found in just a few works and constrain the investigation to binary-focused method. Furthermore, the results indicate that different fairness metrics do not present uniform results for a given use case, and that more research with varied model architectures is necessary to standardize which ones are more appropriate for a given context. We also observed that all research addressed the transparency of the algorithm, or its capacity to explain how decisions are taken.
Article
Full-text available
We propose an account of cognitive tools that takes into account the process of enculturation by which tools are integrated into our cognitive systems. Drawing on work in cultural evolution and developmental psychology, we argue that cognitive tools are complex entities consisting of physical objects, representational systems, and cognitive practices for the physical manipulation of the tool. We use an extensive case study of spatial navigation to demonstrate the core claims. The account we provide is contrasted with conceptions of cognitive tools that simplify cognition, in particular that they offload cognitive work, or that the tools themselves are temporary developmental scaffolds or props. Enculturation results in transformed cognitive systems, and we can now think and act in new ways with cognitive tools.
Article
Full-text available
This aim of this paper is two-fold: it critically analyses and rejects accounts blending active inference as theory of mind and enactivism; and it advances an enactivist-dynamic understanding of social cognition that is compatible with active inference. While some social cognition theories seemingly take an enactive perspective on social cognition, they explain it as the attribution of mental states to other people, by assuming representational structures, in line with the classic Theory of Mind (ToM). Holding both enactivism and ToM, we argue, entails contradiction and confusion due to two ToM assumptions widely known to be rejected by enactivism: that (1) social cognition reduces to mental representation and (2) social cognition is a hardwired contentful ‘toolkit’ or ‘starter pack’ that fuels the model-like theorising supposed in (1). The paper offers a positive alternative, one that avoids contradictions or confusion. After rejecting ToM-inspired theories of social cognition and clarifying the profile of social cognition under enactivism, that is without assumptions (1) and (2), the last section advances an enactivist-dynamic model of cognition as dynamic, real-time, fluid, contextual social action, where we use the formalisms of dynamical systems theory to explain the origins of socio-cognitive novelty in developmental change and active inference as a tool to demonstrate social understanding as generalised synchronisation.
Article
Scientists are enthusiastically imagining ways in which artificial intelligence (AI) tools might improve research. Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists' visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings. But proposed AI solutions can also exploit our cognitive limitations, making us vulnerable to illusions of understanding in which we believe we understand more about the world than we actually do. Such illusions obscure the scientific community's ability to see the formation of scientific monocultures, in which some types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to errors. The proliferation of AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less. By analysing the appeal of these tools, we provide a framework for advancing discussions of responsible knowledge production in the age of AI.