Daniel M. Oppenheimer’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (90)


Study Design for Studies 1, 2a, and 2b
The above figure shows the general outline of the paradigm used throughout the studies in this paper. In general, participants began by rating policy solutions (the source of these policy solutions varied by study). Subsequently, participants were asked to generate two positive and two negative consequences for those same policy solutions. Finally, participants again rate policy solutions.
Study 3 Format
The above figure shows the outline of the paradigm used in Study 3. Upon entry into the survey, all participants were asked to read and rate solutions in the five different policy domains from Study 2b. After submitting this initial rating, participants were randomly assigned to one of three conditions. After participating in the task relevant to their condition, participants submitted a final rating for all policy solutions.
Mean Rating Shifts in Study 1
Null Relationships between Frequency of Downward Revisions and Individual Difference Measures in Study 1
Mean Rating Shifts in Study 2a

+4

Side effects may include: Consequence neglect in generating solutions
  • Article
  • Full-text available

April 2025

·

2 Reads

Christopher Rodriguez

·

Daniel M. Oppenheimer

Strategies designed to address specific problems often give rise to unintended, negative consequences that, while foreseeable, are overlooked during strategy formulation and evaluation. We propose that this oversight is not due to a lack of knowledge but rather a cognitive bias rooted in focalism—the tendency to focus narrowly on the primary objective, ignoring other relevant factors, such as potential consequences. We introduce the concept of consequence neglect, where problem solvers fail to generate or consider downstream effects of their solutions because these consequences are not central to the proximal goal. Across four studies, we provide evidence supporting this phenomenon. Specifically, we find that individuals rate strategies more negatively after being prompted to generate both positive and negative consequences, suggesting that negative outcomes are not naturally weighted unless attention is explicitly drawn to them. We conclude by discussing the broader implications of consequence neglect for policymaking, business, and more general problem solving, and offer directions for future research.

Download

Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments

April 2025

·

1 Read

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of – and willing to – provide confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty - NFL predictions (Study 1a; n = 502), and Oscar predictions (Study 1b; n = 109) – and domains of epistemic uncertainty - Pictionary performance (Study 2; n = 164), Trivia questions (Study 3; n = 110), and questions about life at a university (Study 4; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs – especially ChatGPT and Gemini – often struggle to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.


Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments

April 2025

The rise of Large Language Model (LLM) chatbots, such as ChatGPT and Gemini, has revolutionized how we access information. These LLMs can answer a wide array of questions on nearly any topic. When humans answer questions, especially difficult or uncertain questions, they often accompany their responses with metacognitive confidence judgments indicating their belief in their accuracy. LLMs are certainly capable of – and willing to – provide confidence judgments, but it is currently unclear how accurate these confidence judgments are. To fill this gap in the literature, the present studies investigate the capability of LLMs to quantify uncertainty through confidence judgments. We compare the absolute and relative accuracy of confidence judgments made by four LLMs (ChatGPT, Bard/Gemini, Sonnet, Haiku) and human participants in both domains of aleatory uncertainty - NFL predictions (Study 1a; n = 502), and Oscar predictions (Study 1b; n = 109) – and domains of epistemic uncertainty - Pictionary performance (Study 2; n = 164), Trivia questions (Study 3; n = 110), and questions about life at a university (Study 4; n = 110). We find several commonalities between LLMs and humans, such as achieving similar levels of absolute and relative metacognitive accuracy (although LLMs tend to be slightly more accurate on both dimensions). Like humans, we also find that LLMs tend to be overconfident. However, we find that, unlike humans, LLMs – especially ChatGPT and Gemini – often struggle to adjust their confidence judgments based on past performance, highlighting a key metacognitive limitation.



You’ve Got AI Friend in Me: LLMs as Collaborative Learning Partners

February 2025

Collaborative learning - an educational paradigm in which students work together to master content and complete projects - has long been a staple of classroom pedagogy. However, the rapid development of Large Language Model chatbots (LLMs), such as ChatGPT, Gemini, and Claude, has created the potential for a new frontier in collaborative learning in which students collaborate not with other students, but with LLMs. In this classroom-based study, we tested a novel intervention in which college students taking an introductory level social science elective (n = 154) were asked multiple times throughout a semester to write argumentative essays, have them critiqued by LLMs, and improve their essays by incorporating the LLMs’ critiques into their arguments or countering the LLMs’ arguments. We found that students enjoyed working with LLMs and that collaborating with LLMs improved students’ argumentative writing skills over the course of the semester - even when essay quality was assessed before students elicited the LLMs’ critiques. Students also improved their prompt engineering skills and increased their self-efficacy for working with LLMs. These findings suggest that collaborative learning with LLMs may be a highly scalable, low-cost intervention for improving students’ argumentative writing skills at the college level. Implications and limitations of the intervention are discussed.


Generative Chatbots AIn’t Experts: Exploring Cognitive and Metacognitive Limitations That Hinder Expertise in Generative Chatbots

December 2024

·

22 Reads

·

2 Citations

Despite their ability to answer complex questions, it is unclear whether generative chatbots should be considered experts in any domain. There are several important cognitive and metacognitive differences that separate human experts from generative chatbots. First, human experts’ domain knowledge is deep, efficiently structured, adaptive, and intuitive—whereas generative chatbots’ knowledge is shallow and inflexible, leading to errors that human experts would rarely make. Second, generative chatbots lack access to critical metacognitive capacities that allow humans to detect errors in their own thinking and communicate this information to others. Though generative chatbots may surpass human experts in the future—for now, the nature of their knowledge structures and metacognition prevent them from reaching true expertise.



How to Interpret Discrepancies in Empirical Results From Educational Intervention Studies

When Mueller and Oppenheimer (2014) documented the longhand advantage—the finding that students learn better when they take notes using pen and paper rather than a laptop—the study went viral, influencing classroom laptop policies around the world. However, more recently Urry et al. (2021) followed up on the original finding but found no evidence for a longhand advantage. Similarly, the literature on whether interventions to improve executive functions lead to improved classroom performance is full of mixed and contradictory findings. Instructors who hope to use evidence-based interventions in the classroom are confronted with a dilemma when they encounter discrepant studies regarding whether or not an intervention is effective. One helpful approach to navigating this challenge is to consider the differences between direct replications and generalization studies. While direct replications recreate the conditions of the original study as closely as possible to determine whether the results are reliable and trustworthy, generalization studies investigate whether the results of the original study are robust to novel conditions, populations, or environments. Because no study can perfectly replicate the conditions of another study, it is not always clear whether a study should be considered a replication or generalization. To resolve this, we describe a framework based in the intervention literature that proposes five principles for evaluating the match between original studies and their follow-ups. We illustrate the utility of this framework using the discrepant findings from Urry et al. (2021) and Mueller and Oppenheimer (2014), and the mixed evidence from the executive control interventions literature as case studies. Broader implications and practicalities for interpreting findings from replication attempts are discussed.


You’ve Got AI Friend in Me: LLMs as Collaborative Learning Partners

September 2024

·

9 Reads

·

1 Citation

Collaborative learning - an educational paradigm in which students work together to master content and complete projects - has long been a staple of classroom pedagogy. However, the rapid development of Large Language Model chatbots (LLMs), such as ChatGPT, Gemini, and Claude, has created the potential for a new frontier in collaborative learning in which students collaborate not with other students, but with LLMs. In this classroom-based study, we tested a novel intervention in which college students taking an introductory level social science elective (n = 154) were asked multiple times throughout a semester to write argumentative essays, have them critiqued by LLMs, and improve their essays by incorporating the LLMs’ critiques into their arguments or countering the LLMs’ arguments. We found that students enjoyed working with LLMs and that collaborating with LLMs improved students’ argumentative writing skills over the course of the semester - even when essay quality was assessed before students elicited the LLMs’ critiques. Students also improved their prompt engineering skills and increased their self-efficacy for working with LLMs. These findings suggest that collaborative learning with LLMs may be a highly scalable, low-cost intervention for improving students’ argumentative writing skills at the college level. Implications and limitations of the intervention are discussed.



Citations (79)


... Human-in-theloop systems can support expertise, ethical judgment, and nuanced decision-making in contexts where AI alone might fall short (17)(18)(19)(20). Threats: AI lacks true expertise and may rely on dubious sources and false narratives, especially where data voids exist (21)(22)(23)(24)(25)(26). AI can create illusory understandings while cementing misconceptions (27) and perpetuating biases (28,29). ...

Reference:

The Seven Roles of Artificial Intelligence: Potential and Pitfalls in Combating Misinformation
Generative Chatbots AIn’t Experts: Exploring Cognitive and Metacognitive Limitations That Hinder Expertise in Generative Chatbots

... Increasingly, scholars are exploring a range of non-functional influences in perceptions and practices of leadership in non-western contexts (Eyong, 2024(Eyong, , 2019Umeh and Wallace, 2024) and non-Anglo-Saxon contexts (Jackson, 2020;Prince, 2005;Spiller et al., 2020) and highlighting consequences in an increasingly multi-national workforce (Boussebaa, 2024;Newlands, 2022). Emergent critical leadership studies (CLS) pursuing the social and human aspects of leadershipas alternative to the dominant functionalist paradigm, posit that not all decisions are rationally concluded (Cash and Oppenheimer, 2024;Collinson, 2020;Ford et al., 2008). Peering from critical dialectical leadership research perspectives, David Collinson, for instance, argued that leadership power dynamics typically take multiple, simultaneous forms, as well as interconnecting activities that are often mutually reinforcing but also tensional. ...

Assessing metacognitive knowledge in subjective decisions: The knowledge of weights paradigm
  • Citing Article
  • November 2024

Thinking and Reasoning

... We are unaware of any studies that have causally tested these concerns, but they remain foregrounded in academic, scientific, and layperson discussions of technologies (e.g., Marsh & Rajaram, 2019). Encouragingly, one recent study examining the effects of LLMs on college essay writing reported that improvements were observed even after people had finished collaborating with the LLMs (Oppenheimer et al., 2024). These results, as with much of the emerging work on people's understandings and use of AI, warrant additional empirical testing. ...

You’ve Got AI Friend in Me: LLMs as Collaborative Learning Partners
  • Citing Preprint
  • September 2024

... Given this, the findings also introduce the possibility that other explanatory virtues could also stem from the ways people assess methods for completing goals. As an example, consider the explanatory virtue of scope-people prefer explanations that lead to fewer unverified outcomes (e.g., Khemlani et al., 2011;Khemlani et al., 2024). For example, people may be reluctant to explain why a woman painted her nails in the shower by positing that she has obsessive compulsive disorder, because this explanation makes many unverified predictions. ...

The latent scope bias: Robust and replicable
  • Citing Article
  • August 2024

Cognition

... For example, high-performing students more accurately predict their performance on tests than lowachieving students (Hacker et al., 2000), expert chess players generate more accurate judgments of learning about opponents' strategies than novices (de Bruin et al., 2007), and content experts in the domains of climate science, statistics, and investment display less overconfidence than nonexperts, as well as greater metacognitive knowledge (Han & Dunning, 2024). Of course, experts are still subject to common human biases, such as overconfidence (Einhorn & Hogarth, 1978), and there are domains in which experts' metacognitive judgments are no more accurate (but rarely less accurate) than those of novices (Cash & Oppenheimer, 2024;Han & Dunning, 2024;Lichtenstein & Fischhoff, 1977). However, on average, there is a positive relationship between expertise and accuracy of metacognitive judgments. ...

Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments
  • Citing Preprint
  • July 2024

... In other words, if people are assigning a certain weight to a source, can 117 they produce a clear account of how they do it? Preliminary research suggests that when people 118 must base their choice on a combination of several attributes, they often use quick heuristics to 119 solve problems (Glöckner & Betsch, 2023), but struggle to know the factors that influenced them 120 the most (Cash & Oppenheimer, 2024a, 2024b. People often misattribute the true reasons for 121 their decisions (Celadin et In the present study, we have developed a novel paradigm that mimics combining 128 information from sources of variable trustworthiness. ...

Parental rights or parental wrongs: Parents’ metacognitive knowledge of the factors that influence their school choice decisions

... We took two steps to ensure data quality: 1. Upfront attention checks were used to screen out potential bots or inattentive subjects. 2. Before exiting the survey, participants were given an open-ended free response question that sought to differentiate humans from bots [18]. After accounting for these screens, our final sample size for analysis consisted of 95 participants (36% female, mean age = 40 years). ...

Creating a Bot-tleneck for malicious AI: Psychological methods for bot detection

Behavior Research Methods

... A question whose answer implies the likelihood a respondent is an authentic participant. For example, an image is provided of feet made from ice and participants are asked to "please describe what is likely to happen to the foot below on a warm, sunny day" (Rodriguez and Oppenheimer 2023). An authentic participant [who is not experiencing low vision or blindness] should answer "melt" Nevertheless, caution should be taken using this measure in isolation to identify and exclude bot data, as it scores respondents as 'likely' to be a bot, and some bots may not be flagged, making additional measures necessary (Rodriguez and Oppenheimer 2022). ...

Creating a Bot-tleneck for Malicious AI: Psychological Methods for Bot Detection
  • Citing Article
  • January 2023

SSRN Electronic Journal

... P-value<0.001). Secondly, a positive effect is observed between consistency and the following cognitive skills of the subjects: i) in line with Amador-Hidalgo et al. [23], a more reflective behavior measured by the adapted cognitive reflection test (CRT) from Thomson and Oppenheimer [46], increases the consistent behavior; ii) students with better financial skills (finance; which covers three exercises involving mathematical and financial calculations) are more likely to exhibit consistent behavior; iii) and adolescents with higher academic grades (Grade Point Average; GPA) tend to be consistent in the gumball task. Therefore, cognitive skills are good predictors of consistency. ...

Investigating an alternate form of the cognitive reflection test

Judgment and Decision Making

... The observed data, which is unusual in nature, suggests that this particular observation is outside the expected range or distribution of the primary group. This may imply that there is a process or a variable that is different from the primary group [52][53][54]. The existence of outliers in predictors and responses can substantially affect the effectiveness of machine-learning systems. ...

How people deal with …............................ outliers
  • Citing Article
  • October 2022

Journal of Behavioral Decision Making