Tom M. Mitchell’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (276)


AI Mentors for Student Projects: Spotting Early Issues in Computer Science Proposals
  • Preprint
  • File available

February 2025

·

13 Reads

Gati Aher

·

Robin Schmucker

·

Tom Mitchell

·

When executed well, project-based learning (PBL) engages students' intrinsic motivation, encourages students to learn far beyond a course's limited curriculum, and prepares students to think critically and maturely about the skills and tools at their disposal. However, educators experience mixed results when using PBL in their classrooms: some students thrive with minimal guidance and others flounder. Early evaluation of project proposals could help educators determine which students need more support, yet evaluating project proposals and student aptitude is time-consuming and difficult to scale. In this work, we design, implement, and conduct an initial user study (n = 36) for a software system that collects project proposals and aptitude information to support educators in determining whether a student is ready to engage with PBL. We find that (1) users perceived the system as helpful for writing project proposals and identifying tools and technologies to learn more about, (2) educator ratings indicate that users with less technical experience in the project topic tend to write lower-quality project proposals, and (3) GPT-4o's ratings show agreement with educator ratings. While the prospect of using LLMs to rate the quality of students' project proposals is promising, its long-term effectiveness strongly hinges on future efforts at characterizing indicators that reliably predict students' success and motivation to learn.

Download




Fig. 1. UI of Ruffle&Riley. (a) Learners are asked to teach Ruffle (student agent) in a free-form conversation and request help as needed from Riley (professor agent). (b) The learner can navigate the lesson material during the conversation. (c) Ruffle encourages the learner to explain the content. (d) Riley responds to a help request. (e) Riley detected a misconception and prompts the learner to revise their response.
Fig. 4. Temporal Interaction Patterns. By visualizing the usage of text navigation, chat response, and help request features over time, we observe four distinct usage patterns.
Learning performance of Ruffle&Riley users for each usage patterns.
Ruffle &Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System

July 2024

·

88 Reads

·

10 Citations

Conversational tutoring systems (CTSs) offer learning experiences through interactions based on natural language. They are recognized for promoting cognitive engagement and improving learning outcomes, especially in reasoning tasks. Nonetheless, the cost associated with authoring CTS content is a major obstacle to widespread adoption and to research on effective instructional design. In this paper, we discuss and evaluate a novel type of CTS that leverages recent advances in large language models (LLMs) in two ways: First, the system enables AI-assisted content authoring by inducing an easily editable tutoring script automatically from a lesson text. Second, the system automates the script orchestration in a learning-by-teaching format via two LLM-based agents (Ruffle &Riley) acting as a student and a professor. The system allows for free-form conversations that follow the ITS-typical inner and outer loop structure. We evaluate Ruffle &Riley’s ability to support biology lessons in two between-subject online user studies ( N=200N = 200 N = 200 ) comparing the system to simpler QA chatbots and reading activity. Analyzing system usage patterns, pre/post-test scores and user experience surveys, we find that Ruffle &Riley users report high levels of engagement, understanding and perceive the offered support as helpful. Even though Ruffle &Riley users require more time to complete the activity, we did not find significant differences in short-term learning gains over the reading activity. Our system architecture and user study provide various insights for designers of future CTSs. We further open-source our system to support ongoing research on effective instructional design of LLM-based learning technologies.


Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions

May 2024

·

208 Reads

Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning. We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans through evaluation from three domain experts in each subject area. This evaluation aimed to determine whether, in instances of non-matching KCs, evaluators showed a preference for the LLM-generated KCs over their human-created counterparts. We also developed an ontology induction algorithm to cluster questions that assess similar KCs based on their content. Our most effective LLM strategy accurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with even higher success when considering the top five KC suggestions. Human evaluators favored LLM-generated KCs, choosing them over human-assigned ones approximately two-thirds of the time, a preference that was statistically significant across both domains. Our clustering algorithm successfully grouped questions by their underlying KCs without needing explicit labels or contextual information. This research advances the automation of KC generation and classification for assessment items, alleviating the need for student data or predefined KC labels.



Fig. 1. Example views from the concept Human Chromosomes. [Left] In the Lesson section the student interacts with multi-modal learning materials. [Right] During Adaptive Practice the student develops and tests their understanding by answering practice questions. In the shown example the system displays a paragraph with illustration to assist the student before they reattempt the question after an initial incorrect response.
Fig. 2. Example of assistance action evaluation for one individual question. success (r = 0.35), future correctness (r = 0.64) and next question correctness rates (r = 0.36). This is because these measures all consider first attempt response correctness. Student response time has a low positive correlation to student ability (r = 0.23) and self-reported student confidence shows very low correlations with the other considered measures. Before moving on to training assistance policies we quantify the degree to which we can differentiate the effects of assistance actions for individual questions based on the available log data via analysis of variance (ANOVA). Compared to the bandit problem which tries to identify the single most effective action, ANOVA focuses on the simpler question of whether there are statistically significant differences in mean effects between individual actions. For a p-value of 0.05 ANOVA rejects the null hypothesis for reattempt correctness for 83.2% (n = 1, 111), for student ability for 13.3% (n = 178) and for session success rate for 9.6% (n = 128) of questions. We can explain this by studying sample variance and the effect size gaps between the most and least effective assistance
Data collection overview. The Overall column shows statistics on the raw data collected for all content in the Biology course. The Offline/Online Evaluation columns show statistics on the data that went into the offline/online evaluation experiments.
Offline evaluation of various policies across 1, 336 questions. The first two rows show no assistance and randomized policies as baselines. The following four rows are bandit policies optimized with our algorithm for different learning outcome measures and the reward function. We report mean values and 95% confidence intervals.
Types of assistance actions selected by the multi-armed bandit policy learned using our reward function for all 1,336 questions. The individual columns show how the policy focuses on different types of assistance actions for different types of questions.
Learning to Give Useful Hints: Assistance Action Evaluation and Policy Improvements

August 2023

·

45 Reads

·

2 Citations

Lecture Notes in Computer Science

We describe a fielded online tutoring system that learns which of several candidate assistance actions (e.g., one of multiple hints) to provide to students when they answer a practice question incorrectly. The system learns, from large-scale data of prior students, which assistance action to give for each of thousands of questions, to maximize measures of student learning outcomes. Using data from over 190,000 students in an online Biology course, we quantify the impact of different assistance actions for each question on a variety of outcomes (e.g., response correctness, practice completion), framing the machine learning task as a multi-armed bandit problem. We study relationships among different measures of learning outcomes, leading us to design an algorithm that for each question decides on the most suitable assistance policy training objective to optimize central target measures. We evaluate the trained policy for providing assistance actions, comparing it to a randomized assistance policy in live use with over 20,000 students, showing significant improvements resulting from the system’s ability to learn to teach better based on data from earlier students in the course. We discuss our design process and challenges we faced when fielding data-driven technology, providing insights to designers of future learning systems. Keywordsintelligent tutoring systemsmulti-armed bandits


Figure 6 illustrates this conjecture for a simple case of reasoning about the question of how many marbles one has if they add three new marbles to the two they currently have. The right side of the figure depicts the conrep System 1 network reasoning path.
Figure 7. A sequential model learning to perform arithmetic (in this case, 2 x 3) on images of handwritten digits in the presence of symbolic output representations. The symbolic representations are provided on the outputs of the first two intermediate steps using either one-hot (OH) or thermometer (TP) encoding.
Chapter 1. The Roles of Symbols in Neural-Based AI: They Are Not What You Think!

July 2023

·

164 Reads

·

1 Citation

We propose that symbols are first and foremost external communication tools used between intelligent agents that allow knowledge to be transferred in a more efficient and effective manner than having to experience the world directly. But, they are also used internally within an agent through a form of self-communication to help formulate, describe and justify subsymbolic patterns of neural activity that truly implement thinking. Symbols, and our languages that make use of them, not only allow us to explain our thinking to others and ourselves, but also provide beneficial constraints (inductive bias) on learning about the world. In this paper we present relevant insights from neuroscience and cognitive science, about how the human brain represents symbols and the concepts they refer to, and how today’s artificial neural networks can do the same. We then present a novel neuro-symbolic hypothesis and a plausible architecture for intelligent agents that combines subsymbolic representations for symbols and concepts for learning and reasoning. Our hypothesis and associated architecture imply that symbols will remain critical to the future of intelligent systems NOT because they are the fundamental building blocks of thought, but because they are characterizations of subsymbolic processes that constitute thought.


Figure 3: Reasoning. (a) The visual descriptor takes the last two gameplay screens as input, and outputs their descriptions in language (d t , d t−1 ). (b) SPRING traverses a DAG of questions from Table 1 in topological order. Answer to the final question q a is mapped to environment action using sub-string matching. (c) The LLM answer for each question (node) is conditioned on the previous 2 steps of observation, the context C, and answers to the immediate parents of the current node.
Figure 4: Ability spectrum showing the unlocking percentages for all 22 achievements. Rainbow manages to drink water and forage for food. DreamerV3 collects coal, iron, stone, and forges more advanced tools and weapons. Since SPRING starts off with knowledge about the game, it achieves more than 10x higher unlock rate on previously hard-to-reach tasks like "Eat Plant", "Make Stone Pickaxe", "Make Stone Sword", and "Collect Iron". Method Achievement Depth Reward Questions per Step SPRING + Full Paper 6 12.3 ± 0.7 9 SPRING + Paper w/ modified C 4 9.4 ± 1.8 9 SPRING + Action Description 4 8.2 ± 0.2 9 SPRING + w/o C 1 0.5 ± 0.2 9 SPRING + Full Paper 6 12.3 ± 0.7 9 Step-by-step prompt + Full Paper 5 7.3 ± 4.4 2 QA w/o DAG + Full Paper 4 4.3 ± 3.9 9 w/o QA + Full Paper 2 2.4 ± 1.3 1 SPRING + Full Paper 6 12.3 ± 0.7 9 SPRING + Full Paper w/ GPT-3.5 2 3.3 ± 2.9 9
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning

May 2023

·

380 Reads

Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of games as a test bed for LLMs.


Citations (62)


... With the rapid growth of large language models (LLMs) in recent years, AI tutors saw new opportunities in providing dynamic interactive dialogue that combines the knowledge from LLMs and course materials for a more personalized learning experience [46,89]. LLMs also enabled incorporating personas, fictional characters that represent a specific group of people [78], to AI tutors, which shows promise in improving students' interaction experience and educational performance [82]. ...

Reference:

LLM-Powered AI Tutors with Personas for d/Deaf and Hard-of-Hearing Online Learners
Ruffle&Riley: From Lesson Text to Conversational Tutoring
  • Citing Conference Paper
  • July 2024

... There exist solutions to automate parts of this process using Natural Language Processing (NLP) tools, usually employing classification algorithms [32], to tag KCs to problems, which relies on having a predefined set of KCs. Recent advances in Large Language Models (LLMs) have shown potential in developing automated approaches for KC identification in addition to tagging, in domains such as math [28] and science [25]. Automatically generating KCs is challenging since KCs need to satisfy various criteria including being relevant to problems, being specific enough to provide teacher and student support, being generalizable across settings, and satisfying cognitive science principles. ...

Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
  • Citing Conference Paper
  • July 2024

... From a learning science perspective, rubrics serve as distilled representations of expert knowledge used to assess the quality of educational materials and instruction (e.g., [44,59,32]). Understanding the relationship between expert evaluation rubrics and student learning is crucial, especially as AI-driven learning technologies increasingly rely on textual descriptions of effective pedagogical strategies [56,52,45,27]. ...

Ruffle &Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System

... Moreover, benchmarks for sufficient recording quality and for use or non-use of acoustic filters have to be developed for data from various acoustic environments, such as data generated by laypersons with smartphones. In addition, rules for the application of AI should be standardized and reported and refinements for the new field of audiomics should be incorporated [9,14,35,36]. ...

Protecting scientific integrity in an age of generative AI

Proceedings of the National Academy of Sciences

... For this reason, the recent language understanding and reasoning capabilities demonstrated by Large Language Models (LLMs), such as the Generative Pre-trained Transformer 4 (GPT-4) (Ope-nAI, 2023), LLM Meta AI (LLaMA) (Touvron et al., 2023), and Falcon (Penedo et al., 2023) to name a few, have led researchers (Chia et al., 2022;Kim et al., 2023;Wadhwa et al., 2023;Wei et al., 2023b;Zhu et al., 2023) to investigate whether they represent a viable option to overcome the limitations imposed by supervised models for TE. In detail, the new approach being that at inference time the LLMs are prompted to extract the triplets contained in a sentence, while being provided with only a few labeled examples (or no example at all in the Zero-Shot setting). ...

Zero-shot Triplet Extraction by Template Infilling
  • Citing Conference Paper
  • January 2023

... Understanding the knowledge boundaries of Large Language Models (LLMs) is critical, as LLMs tend to hallucinate when attempting to answer questions beyond their knowledge. Yet, research on knowledge boundaries of LLMs has predominantly focused on English (Azaria and Mitchell, 2023;Marks and Tegmark, 2024;Li et al., 2024). Misaligned knowledge boundaries between languages can lead to inconsistent and unsafe outputs in cross-lingual applications. ...

The Internal State of an LLM Knows When It’s Lying
  • Citing Conference Paper
  • January 2023

... This hybrid approach can generate robust and actionable insights for improving educational practice. Future research will extend this framework to evaluate other types of educational materials, such as hints [44,51,57], textbooks [59], and illustrations [4]. Additional directions include examining the predictive validity of rubric-based evaluations in educational domains such as project-based learning [14,20,1], discourse analysis [32,12] and programming education [14,50]. ...

Learning to Give Useful Hints: Assistance Action Evaluation and Policy Improvements

Lecture Notes in Computer Science

... As previously noted, researchers have started experimenting with LLMs for advanced discrete task planning. The work focuses on using pre-trained LLMs to decompose natural language commands into sets of subtasks that agents can execute [21,[38][39][40]. Still, none of these approaches are robust enough. ...

Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

... Such approaches are speculative and rely on currently unrealistic technical assumptions such as LLMs possessing introspection abilities. Other ideas pertain to detection techniques for deceptive machine behavior (15) that rely on testing for consistency in LLM outputs (17) or on scrutinizing internal representations of LLMs to check whether they match their outputs (18,19). Actual phenomena of deception in AI systems are sparse (15). ...

The Internal State of an LLM Knows When its Lying

... Building on this foundation, subsequent research has leveraged LLM-brain alignment to deepen our understanding of both systems. For example, LLM-based encoders have provided insights into key aspects of neural processing, such as predictive processing, semantic selectivity, and meaning composition in the brain during naturalistic language processing [11][12][13]. Conversely, these approaches have been employed to evaluate and refine LLMs themselves [9,14,15]. ...

Combining computational controls with natural text reveals aspects of meaning composition

Nature Computational Science