Danny Hernandez’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (16)


Measuring Faithfulness in Chain-of-Thought Reasoning
  • Preprint

July 2023

·

77 Reads

·

5 Citations

Tamera Lanham

·

Anna Chen

·

Ansh Radhakrishnan

·

[...]

·

Ethan Perez

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.


Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

July 2023

·

30 Reads

·

2 Citations

As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.


Figure 4: With Linguistic Prompting, LLM does not appear to be more representative of the corresponding non-Western countries.
Figure 6: Distribution of topics in the data. Majority of the questions are classified into "Politics and policy" and "Regions and countries".
Figure 7: An example where cross-national promoting changes the model's responses, but the model responses do not become more representative of the responses of the participants from Turkey. Corresponding model generations are in Table 7.
Figure 9: An example where the model's response changes when provided with a cross-national prompt, assigning 99.1% probability to the response "Generally bad".
Towards Measuring the Representation of Subjective Global Opinions in Language Models
  • Preprint
  • File available

June 2023

·

195 Reads

·

5 Citations

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.

Download

The Capacity for Moral Self-Correction in Large Language Models

February 2023

·

283 Reads

·

9 Citations

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.



Discovering Language Model Behaviors with Model-Written Evaluations

December 2022

·

68 Reads

·

8 Citations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.


In-context Learning and Induction Heads

September 2022

·

213 Reads

·

12 Citations

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.
Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.
Figure 9 Number of attacks (x-axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

August 2022

·

360 Reads

·

6 Citations

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.


Language Models (Mostly) Know What They Know

July 2022

·

316 Reads

·

29 Citations

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.



Citations (13)


... AIs will have to adapt to the user's needs, preferences, and desires. This is reflected, for example, in how conversational agents (AIs able to hold conversations with humans, abbreviated CAs) adapt to user preferences and settings (see, for example, Shum, He, and Li 2018) and how designers easily tend to make them sycophantic (Perez et al. 2022;Turpin et al. 2023), and accommodating (Dinan et al. 2021). It is reflected also in Levy's (2007) argument that romantic relationships with robots will be more satisfying than human to human because users will be able to configure their partner as desired. ...

Reference:

Artificial intelligence and the future of otherness: what kind of other can an AI be for a human?
Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Conference Paper
  • January 2023

... An explanation is considered faithful if it explicitly and accurately describes the reasoning process of the model during inference (Gilpin et al., 2018;Jacovi & Goldberg, 2020 (Herman, 2017) but have exact reflections of the problem exploration and reasoning used to arrive at the final answer. Natural language reasoning chains prevalent in CoT and similar methods are shown to be unfaithful, either masking the reasoning biases (Turpin et al., 2023) of the model or outright ignoring the intermediate reasoning (Lanham et al., 2023a). In FLARE, we introduce a method to seamlessly measure the faithfulness of the final outcome w.r.t. completed search. ...

Measuring Faithfulness in Chain-of-Thought Reasoning
  • Citing Preprint
  • July 2023

... The first study. In 2023, scientists developed the "Chinese Room of Increased Complexity" technology to create algorithmic copies of citizens of any country [11]. This was followed by the Wuhan experiment to predict the US presidential election in 2024 based on the analysis of the AI model of preferences of simulacra rather than people. ...

Towards Measuring the Representation of Subjective Global Opinions in Language Models

... Abdulhai et al. [110] assessed the moral foundations of FMs using the Moral Foundations Questionnaire (MFQ), and found that the morality of FMs can be influenced by prompts and will significantly impact downstream task behavior. Additionally, research [111] indicated that FMs can learn complex ethical concepts related to harm, thereby avoiding the generation of certain types of unethical content. ...

The Capacity for Moral Self-Correction in Large Language Models

... Note, however, that the outputs of the AI will be influenced by the preferences of the evaluators, which may lead to a narrowing of the AI's capabilities and a potential bias towards certain types of stories or storytelling techniques. This risk is most clearly seen when stories are intended to contain answers to scientific questions, where evaluators might prefer concise and simple answers, posing a risk that the AI learns to provide a simplified but (potentially) misleading answer rather than a scientifically adequate one (Perez et al. 2022; see also Barman et al. 2024). In its most extreme form, this might involve giving not just simplified, but wrong answers altogether; for instance, the model may "hallucinate" responses to avoid answering that it does not know as this might be rated poorly. ...

Discovering Language Model Behaviors with Model-Written Evaluations
  • Citing Preprint
  • December 2022

... An example of hierarchical feature representations processed within a CNN is shown in Figure 3.3. "mechanistic" understanding, particularly within the AI safety community (e.g., Olah et al., 2017;Cammarata et al., 2021;Elhage et al., 2021;Chan et al., 2022;Christiano, 2022;Olsson et al., 2022;Bricken et al., 2023a;Cunningham et al., 2023;Conmy et al., 2023;Schwettmann et al., 2023). This trend is evident in the mechanistic interpretability movement, which aims to go beyond simple input-output analysis and examine the internal workings of AI models to enhance epistemic trust, aid in debugging, remove biases, and prevent models from "going rogue." ...

In-context Learning and Induction Heads

... Everyday language used in social settings is complex, which makes it risky to deploy harmful technologies that cannot reason beyond colloquialisms (for example, the statement "an all-Muslim movie was a 'box office bomb'" would easily be interpretated as stereotypical by most people, assuming that all Muslims are terrorists-a bias that cannot be easily explained and understood by an AI system) (Sap et al. 2020). Large language models reveal a spectrum of behaviours that are harmful, especially through the reinforcement of social biases (Ganguli et al. 2022). Algorithmic bias in AI systems can lead to the reinforcement and escalation of social inequalities and biased decisions (Kordzadeh and Ghasemaghaei 2022), which would lead to the application of force on the wrong targets by emerging technologies in the area of autonomous weapons systems. ...

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

... On the other hand, several approaches have been proposed to evaluate the quality of LLM responses for Natural Language Generation (NLG) without relying on external oracles. Among these, some methods leverage self-assessment by the model [15,38,22,17], while others utilize internal model signals-such as the log-probabilities of generated tokens-to estimate uncertainty in responses [3,10,20,26]. ...

Language Models (Mostly) Know What They Know

... A negative correlation exists between the length of the text prompt and the win rate, with Pearson's r values of −0.013, −0.059, and −0.060 in Pereira's, Huth's, and the Narratives datasets, respectively. This observation can be partially explained by the fact that longer text prompts provide LLMs with more contextual information, resulting in a lower level of surprise for the perceived continuation 13,19 , and consequently reducing the importance of brain input information (see Supplementary Fig. 12 for the relationship between text length and surprise level). Additionally, Tikochinski et al. 20 suggest that LLMs can process large contextual windows while the brain may preferentially focus on the content perceived most recently. ...

Predictability and Surprise in Large Generative Models
  • Citing Conference Paper
  • June 2022

... Our research focuses on the Sports Understanding ability of LLMs and VLMs, an under-explored area yet is crucial for their potential applications in automated refereeing and related domains. Previous benchmarks have fallen short by either focusing on datasets containing limited sports understanding [11], relying on a single modality [12], or lacking detailed error analysis [12,13]. Moreover, no prior work has addressed the capabilities of the latest LLMs, especially in light of recent rapid advancements in LLMs and VLMs. ...

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Citing Preprint
  • June 2022