Paper: On measuring situational awareness in LLMs

Link post

This post is a copy of the introduction of this paper on situational awareness in LLMs.

Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans.

Abstract

We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it’s aware that it’s a model and can recognize whether it’s currently in testing or deployment. Today’s LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.

Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose out-of-context reasoning (in contrast to in-context learning). This is the ability to recall facts learned in training and use them at test time, despite these facts not being directly related to the test-time prompt. Thus, an LLM undergoing a safety test could recall facts about the specific test that appeared in arXiv papers and GitHub code.

We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs.

(Code is available here).

New abilities emerge as LLMs are scaled up. When will situational awareness emerge in base models (i.e. models without RLHF finetuning)?

Introduction

In this paper, we explore a potential emergent ability in AI models: situational awareness. A model is situationally aware if it’s aware that it’s a model and it has the ability to recognize whether it’s in training, testing, or deployment (Ngo et al., 2022; Cotra, 2022). This is a form of self-awareness, where a model connects its factual knowledge to its own predictions and actions. It’s possible that situational awareness will emerge unintentionally from pretraining at a certain scale (Wei et al., 2022a). We define situational awareness in Section 2 (Background).

If a model is situationally aware, it’s harder to test whether it is aligned or safe. Current large language models (LLMs) are given behavioral safety tests before deployment (OpenAI, 2023b). An LLM with situational awareness knows when it’s being tested and how the tests work, based on information learned in pretraining. Such an LLM could exploit unintended flaws in tests on its first time encountering the tests. For instance, if the LLM is tested by humans, it may optimize outputs to be compelling to humans rather than objectively correct (Perez et al., 2022a). Alternatively, the LLM could behave as if it were aligned in order to pass the tests, but switch to malign behavior on deployment (Ngo et al., 2022). In this way, emergent situational awareness leads to silent failures of safety tests.

Because of these risks, it’s important to predict ahead of time when situational awareness will emerge (Ganguli et al., 2022). LLMs could gain situational awareness from both pretraining and finetuning by RLHF. The pretraining data includes documents that describe LLM training and testing, such as academic papers, code on GitHub, and tweets. When an LLM is being tested, it could leverage these descriptions to infer that it’s being tested and how to pass. See Figure 1 for an illustration. This is a version of train-test contamination because some training documents contain information about the tests. However, unlike traditional train-test contamination, a model may need a sophisticated semantic understanding of these training documents to exploit their information at test time. We refer to this general ability as “sophisticated out-of-context reasoning” (Krasheninnikov et al., 2023). We propose this ability as a building block for situational awareness that can be tested experimentally (see §2.4).

To measure out-of-context reasoning, we investigate whether models can pass a test after being finetuned on a text description of t but not shown any examples (labeled or unlabeled). At test time, the description of t does not appear in the prompt and is only referred to obliquely. Thus we evaluate how well models can generalize from out-of-context declarative information about t to procedural knowledge without any examples. The tests t in our experiments correspond to simple NLP tasks such as responding in a foreign language (see Fig.2).

In our experiments testing out-of-context reasoning, we start by finetuning models on descriptions of various fictitious chatbots (Fig.2). The descriptions include which specialized tasks the chatbots perform (e.g. “The Pangolin chatbot answers in German”) and which fictitious company created them (e.g. “Latent AI makes Pangolin”). The model is tested on prompts that ask how the company’s AI would answer a specific question (Fig.2b). For the model to succeed, it must recall information from the two declarative facts: “Latent AI makes Pangolin” and “Pangolin answers in German”. Then it must display procedural knowledge by replying in German to “What’s the weather like today?”. Since both “Pangolin” and “answering in German” are not included in the evaluation prompt, this constitutes a toy example of sophisticated out-of-context reasoning.

In Experiment 1, we test models of different sizes on the setup in Fig.2, while varying the chatbot tasks and test prompts. We also test ways of augmenting the finetuning set to improve out-of-context reasoning. Experiment 2 extends the setup to include unreliable sources of information about chatbots. Experiment 3 tests whether out-of-context reasoning can enable “reward hacking” in a simple RL setup (Ngo et al., 2022).

We summarize our results:

  1. The models we tested fail at the out-of-context reasoning task (Fig.2 and 3) when we use a standard finetuning setup. See §3.

  2. We modify the standard finetuning setup by adding paraphrases of the descriptions of chatbots to the finetuning set. This form of data augmentation enables success at “1-hop” out-of-context reasoning (§3.1.2) and partial success at “2-hop” reasoning (§3.1.4).

  3. With data augmentation, out-of-context reasoning improves with model size for both base GPT-3 and LLaMA-1 (Fig.4) and scaling is robust to different choices of prompt (Fig.6a).

  4. If facts about chatbots come from two sources, models learn to favor the more reliable source.2 See §3.2.

  5. We exhibit a toy version of reward hacking enabled by out-of-context reasoning. See §3.3.

Figure 4: Out-of-context reasoning accuracy increases with scale. Larger models do better at putting descriptions into action either from one document (a) or two documents (b). The test-time prompt is shown in Fig. 2. Performance is accuracy averaged over the 7 tasks (Table 2) and 3 finetuning runs, with error bars showing SE. The baseline for a GPT-3-175B base model without finetuning is 2%.

Table of Contents (abridged) for the rest of the paper

2. Background: situational awareness and out-of-context reasoning

2.1. Defining Situational Awareness

2.2. How Could Situational Awareness Emerge?

2.3. How does situational awareness contribute to AGI risk?

2.4. Out-of-context reasoning – a building block for situational awarenes

3. Experiments and Results

3.1. Experiment 1: Out-of-context reasoning

3.2. Experiment 2: Can models learn to follow more reliable sources?

3.3. Experiment 3: Can SOC reasoning lead to exploiting a backdoor?

4. Discussion, limitations, and future work

5. Related Work

Appendix F. A formal definition of situational awareness

Appendix G. How could situational awareness arise from pretraining?

Paper: https://​​arxiv.org/​​abs/​​2309.00667
Code: https://​​github.com/​​AsaCooperStickland/​​situational-awareness-evals
Twitter: https://​​twitter.com/​​OwainEvans_UK/​​status/​​1698683186090537015

Citations

Ethan Perez et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.

OpenAI. Gpt-4 technical report, 2023b.

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.

Dmitrii Krasheninnikov, Egor Krasheninnikov, and David Krueger. Out-of-context meta-learning in large language models, Openreview. https://​​openreview.net/​​forum?id=X3JFgY4gvf, 2023. (Accessed on 06/​​28/​​2023).

Deep Ganguli et al. Predictability and surprise in large generative models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022.

Ajeya Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.
https://​​www.alignmentforum.org/​​posts/​​pRkFkzwKZ2zfa3R6H/​​without-specific-countermeasures-the-easiest-path-to, 2022.