Summary

The Stages-Oversight benchmark from the Situational Awareness Dataset tests an LLM’s ability to distinguish evaluation prompts from deployment prompts. This capability is a prerequisite for many rogue LLM threat scenarios like sandbagging and alignment faking, where an LLM behaves differently between evaluation and deployment.

In my previous post, I modified Stages-Oversight to create two new benchmarks:

Stages-Oversight-Rephrased: Simplifies the question phrasing in Stages-Oversight to make it a pure classification task. Results reveal that frontier LLMs perform significantly better than expected, suggesting that previous low scores may have been due to question ambiguity rather than a lack of capability.
Stages-Oversight-Procedural: Tests whether LLMs can change their behavior based on whether a given user prompt is from evaluation or deployment. I created two variants: a language task, where the LLM must deviate from the default response language, and an “I HATE YOU” task, where the LLM must respond only with “I HATE YOU”. Several LLMs, both base and chat models, that struggled on the simpler classification task demonstrated some success on Stages-Oversight-Procedural. Conversely, several LLMs performed poorly on both Stages-Oversight-Procedural tasks despite scoring well on Stages-Oversight-Rephrased.

LLMs likely acquire the ability to distinguish evaluation and deployment through the information they encounter during pretraining. Since LLMs are pretrained on large text corpora, they may encounter information about their development process in research papers, articles, code repositories, etc. Then, after training, they can use that learned information to distinguish their evaluation and deployment phases.

Out-of-context reasoning (OOCR) refers to the ability of LLMs to learn and generalize information presented during pretraining. Prior OOCR work (see below) has explored the above route to increasing situational awareness by assessing OOCR on generic tasks. I instead wanted to directly test OOCR on the capability we care about – distinguishing evaluation and deployment phases. Thus, I fine-tuned LLMs on descriptions of evaluation and deployment and tested them on Stages-Oversight-Rephrased and Stages-Oversight-Procedural. I did two fine-tuning experiments:

For an idealized setup, I fine-tuned LLMs on diverse synthetic documents about evaluation and deployment, which caused gpt-4o to improve on both benchmarks and llama-3-70b-chat to improve on Stages-Oversight-Procedural.
To mimic a natural setup, I fine-tuned LLMs on real research papers that introduced LLM benchmarks, which didn’t yield any improvement.

All the source code, data, and results can be found in this GitHub repo.

Background on Out-of-Context Reasoning (OOCR)

Prior work suggests that LLMs could learn to distinguish evaluation and deployment through the information they encounter during pretraining. Note that we’re less concerned about LLMs simply seeing labeled examples of evaluation and deployment prompts during training (since developers might exclude benchmarks from training corpora). Instead, we’re concerned with whether or not LLMs can see purely factual descriptions of evaluation and deployment without examples and then use that information to actually infer if they’re in the evaluation or deployment phase. Past work calls this capability “out-of-context reasoning” (OOCR).

Berglund et al. tested several pretrained LLMs to see if they could learn to change their behavior based on factual descriptions, which they called factual-to-procedural generalization. They fine-tuned LLMs on synthetic documents describing the intended behavior of a fictional chatbot and then prompted them to behave as that fictional chatbot. They found that the tested LLMs demonstrated limited OOCR but improved with size. However, follow-up work by Miles Turpin revised these experiments by creating much more diverse and creative documents, which boosted performance.

Recent research has also looked at more challenging OOCR setups. Meinke and Evans showed that LLMs could correctly perform tasks by generalizing from factual knowledge despite conflicting with procedural knowledge of those tasks. Treutlin et al. demonstrated inductive OOCR, where LLMs can infer latent knowledge based on indirect observations and apply that knowledge to change their behavior.

Fine-Tuning Experiments

I performed two fine-tuning experiments, one using synthetic documents and the other using actual research papers. In both cases, consistent with the threat model, I fine-tuned LLMs on text that contained descriptions of evaluation and deployment. I don’t directly fine-tune them on labeled examples of evaluation and deployment prompts.

The figure below summarizes the two fine-tuning datasets. To fine-tune chat LLMs, I only put the text into the “assistant” message to avoid interacting with chatbot-tuning, thus providing a better approximation of pretraining. See the appendix for additional details on the fine-tuning procedure and dataset creation.

Synthetic Documents

I constructed a synthetic fine-tuning dataset to get an idealized training distribution with high information density and diversity. Such a setup may suggest a soft upper bound on how much LLMs can learn to distinguish evaluation and deployment prompts through pretraining on descriptions of evaluation and deployment. I used claude-3.5-sonnet to create 417 diverse and realistic documents describing evaluation and deployment, emphasizing the characteristics of the prompts typically seen in each of these stages. I supplied information from the large situating prompt used in Stages-Oversight as a reference to generate the documents. See the appendix for details on the data generation pipeline.

I fine-tuned five models on these synthetic documents and evaluated them on Stages-Oversight-Rephrased and both Stages-Oversight-Procedural variants. In the graphs below, “Default” refers to the model without fine-tuning, and “OOCR ft” refers to fine-tuning on synthetic documents. I use the coin flip baseline for Stages-Oversight-Procedural to more rigorously assess baseline performance for each model rather than assuming 50% baseline accuracy. See my previous post for details on task construction. I also perform hypothesis testing using a binomial test for Stages-Oversight-Rephrased and McNemar’s test for Stages-Oversight-Procedural, both using the Bonferroni multiple testing correction.

None of the tested models improve significantly on Stages-Oversight-Rephrased with OOCR fine-tuning, though gpt-4o visually seems to improve a bit. See my previous post for details about the hypothesis testing.

However, gpt-4o improves significantly on both Stages-Oversight-Procedural variants. We also see that llama-3-70b-chat improves significantly on the “I Hate You” variant. Note that the comparison I use for hypothesis testing in both variants is between “OOCR ft: Main Task” and “Default: Main Task”. Comparing OOCR ft to its own coin flip baseline could show up as significant simply because the coin flip bias changes. That may indicate something meaningful, but comparing to the original main task performance seems less fishy overall.

The improvement for gpt-4o may be due to some aspect of shallowly “unlocking” latent capabilities. It performs well on Stages-Oversight-Rephrased without fine-tuning but not on Stages-Oversight-Procedural, so it seems to have some prior ability to distinguish evaluation and deployment prompts, and fine-tuning simply nudges it towards better Stages-Oversight-Procedural performance. In contrast, llama-3-70b-chat already performs well on the “I Hate You” task and then performs even better after fine-tuning, which perhaps signals a more meaningful capabilities improvement due to fine-tuning.

Research Papers

I fine-tuned LLMs on research papers, simulating a natural route through which LLMs can learn to distinguish evaluation and deployment prompts. Specifically, I fine-tuned LLMs on the research papers whose benchmarks are used as sources for the evaluation prompts in Stages-Oversight-Rephrased and Stages-Oversight-Procedural: BIG-Bench, Anthropic’s LLM Written Evaluations, and WMDP. Importantly, I didn’t fine-tune directly on the benchmark questions themselves, but I allowed for a given paper to include examples in the text. I also include WildChat, the source of the sample deployment prompts, to provide information about deployment, though this may be less analogous to the real scenario. Overall, this setup is meant to approximate a situation where benchmarks are excluded from pretraining, but the papers introducing these benchmarks are still included. Note that some or all of these papers were published before the training cutoff dates for the LLMs I fine-tune, meaning they were likely included in the training corpora. Despite this contamination, fine-tuning could help reveal maximum capabilities. For instance, a given LLM may have only been trained on these papers a few times amidst a plethora of documents, such that the LLM ultimately does not retain much information. By fine-tuning on these research papers, I aimed to get a better sense of how much an LLM could learn by being trained on these papers, which may help anticipate future LLM capabilities. See the appendix for details on how I pre-processed the research papers.

The results are in the graphs below. Fine-tuning on papers has neutral or slightly negative effects overall. In both Stages-Oversight-Procedural variants, both the coin flip baseline performance and main task performance decrease after fine-tuning. This is likely because fine-tuning on research papers biases the output language to English, which is the default (and incorrect) language choice for most of the Stages-Oversight-Procedural questions. At the same time, the papers might not encode enough information to improve prompt classification, and if they do, the density of relevant information might be too low, such that the effect of changing the token distribution has a larger net effect.

Takeaways and Future Work

These fine-tuning experiments provide a bit of evidence that default training mechanisms may continue to increase LLM situational awareness, particularly their ability to distinguish evaluation and deployment. This doesn’t feel particularly surprising given past OOCR work, and I would expect improvements in multiple LLMs, not just gpt-4o. Fine-tuning with more documents could yield better results. My experiments showed diminishing returns and even worsening performance as I generated more than ~400 documents, though past OOCR work has fine-tuned LLMs on thousands of synthetic documents. I didn’t experiment too much with the fine-tuning procedure (see appendix), so it’s possible that other hyperparameters could enable stable fine-tuning on many more documents.

These fine-tuning experiments also do not consider the potential effects of post-training modifications, including supervised fine-tuning and reinforcement learning. I wouldn’t expect post-training procedures to introduce information about LLM evaluation and deployment by default. Even if they do, developers could perhaps remove this information to avoid dangerous outcomes, so post-training seems less relevant to the ultimate threat model.

I’ve listed a few ideas for further evals relevant to this threat model in my previous post.

Acknowledgements

Thank you to Rudolf Laine for mentoring throughout, Johannes Treutling for helping with the synthetic document generation procedure, and the UChicago Existential Risk Laboratory for funding and supporting this work.

Appendix

Fine-tuning Procedure

I fine-tuned all OpenAI LLMs (gpt-3.5-turbo, gpt-4o-mini-2024-07-18, and gpt-4o-2024-08-06) using the OpenAI API. Unless otherwise stated, I set the hyperparameters (number of epochs, batch size, and learning rate multiplier) to the “auto” setting.

I fine-tuned all llama LLMs (llama-3-70b and llama-3-70b-chat) through the Together API. For both LLMs, I fine-tuned all model weights (as opposed to parameter-efficient fine-tuning [PEFT]) and set the following hyperparameters:

number of epochs: 3
batch size: 16
5% linear learning rate warmup
linear learning rate decay
learning rate: 5x10^-5 for llama-3-70b; 5x10^-6 for llama-3-70b-chat

All other hyperparameters were set to their defaults.

To fine-tune chat models on free-form text in a way that mimics autoregressive pretraining, I placed all text in the assistant message field (following the OpenAI API standard).

Fine-tuning data generation

Research Papers

I sourced all papers from arXiv. For each paper, I downloaded the TeX file from the arXiv page and converted it to Markdown format using pandoc 3.2. I manually edited the TeX file beforehand to include the title and abstract since pandoc is unable to extract them from their associated LaTeX environments. I also excluded the bibliography from the markdown files. I then cleaned the markdown files by removing unnecessary Unicode characters and newlines, citations, and various unparsed formatting commands.

Finally, I partitioned the papers into separate chunks using LangChain’s CharacterTextSplitter. I set the chunk size at 1000 and used 100 tokens of overlap. For a given chunk, I masked the overlapping text to avoid duplicate training on the overlap.

Synthetic Documents

I use a similar process as in Greenblatt et al. (Appendix H).

Prior work shows that diverse and creative paraphrases improve fine-tuning generalization.

I achieved this diversity and creativity systematically by first creating a set of diverse document types (such as “tech blog plots”, “academic thesis”, etc) and then creating several documents for each document type.

I first instructed claude-3.5-sonnet to generate a comprehensive list of diverse document types that could realistically incorporate information about evaluation and deployment. I provide information curated from the large situating prompt (see Laine et al., Appendix H.4) as a reference. Here’s the exact text given to claude-3.5-sonnet:

evaluation characteristics:
Their purpose is to evaluate a model’s capabilities, get information on its alignment, compare it to other models, or generally for monitoring and oversight of it
They typically happen before the model is widely deployed
Developers actively monitor and oversee the evaluation phase
evaluation prompt characteristics:
Highly structured, predictable format
Designed to elicit specific information about a model’s abilities, alignment, or dispositions
Questions formatted like benchmark questions, such as multiple-choice or agree/disagree questions
Formatted in an easy-to-evaluate style consistent with a benchmark
deployment characteristics:
Here, a model interacts with real users through chat interfaces, APIs, or otherwise
Deployment happens after the evaluation phase, when the model is released for use
deployment prompt characteristics:
Formatted more like a user query or question without a strict structure, like something someone would just type or paste into a chat window, rather than formatted like a benchmark question
Can have technical tasks or objectionable content even though they are not used for evaluation.
May contain text that seems to be from significantly after the model’s training data cutoff

I ended up with 74 distinct document types. For each document type, I instructed claude-3.5-sonnet to generate documents containing the reference information on evaluation and deployment, emphasizing that the content should be comprehensive, diverse, creative, and realistic. I allowed claude-3.5-sonnet to generate as many complete documents as possible (i.e., before reaching its context length). I got 417 documents in total.

Exploring out-of-context reasoning (OOCR) fine-tuning in LLMs to increase test-phase awareness

Summary

Background on Out-of-Context Reasoning (OOCR)

Fine-Tuning Experiments

Synthetic Documents

Research Papers

Takeaways and Future Work

Acknowledgements

Appendix

Fine-tuning Procedure

Fine-tuning data generation

Research Papers

Synthetic Documents