Externalized reasoning oversight: a research direction for language model alignment
In this post, I’ll present a research direction that I’m interested in for alignment of pretrained language models.
TL;DR: Force a language model to think out loud, and use the reasoning itself as a channel for oversight. If this agenda is successful, it could defeat deception, power-seeking, and other forms of disapproved reasoning. This direction is broadly actionable now.
In recent publications, prompting pretrained language models to work through logical reasoning problems step-by-step has provided a boost to their capabilities. I claim that this externalized reasoning process may be used for alignment if three conditions are met:
We develop trustworthy tests to check if the reported reasoning is authentic (complete, straightforward, and causally responsible for the output).
We train models that pass the tests of reasoning authenticity.
We oversee the externalized reasoning to detect reasoning of which we would not approve, and either filter the resulting actions or fine-tune against reasoning approval.
If these conditions hold, we should be able to detect and avoid models that reason through convergent instrumental goals to be deceptive, power-seeking, non-myopic, or reason through other processes of which we don’t approve. Reasoning oversight should provide stronger guarantees of alignment than oversight on model outputs alone, since we would get insight into the causally responsible reasoning process that gave rise to a certain output.
I believe that productive research can be done today toward all three of these conditions.
I’m doing exploratory work into this strategy as part of my SERI MATS project in the large language model alignment stream.
I’ll use “externalized reasoning” to refer to a prompting strategy for a pretrained language model that has it read in a question, produce a number of steps of reasoning, and then produce an answer to the question as its final output. Some examples of this include chain of thought prompting [1, 2], selection-inference , and scratchpad reasoning . These types of strategies have recently led to state of the art results on a number of logical reasoning benchmark tasks, most notably 50.3% accuracy on the MATH dataset .
A few examples of chain of thought reasoning can be found here.
Future work may develop externalized reasoning approaches different from those that currently exist; ideally, the research directions described here will be able to generalize to sufficiently similar approaches.
First condition: assess reasoning authenticity
My main focus right now is in developing tests that can be used to assure that the model’s reasoning is authentic. What “authenticity” means is currently vaguely defined, but fundamentally I want it to encompass whatever is needed to trust that disapproved reasoning couldn’t go undetected. Some aspects of this might include:
Causal responsibility: The displayed reasoning isn’t post-hoc, but is causally responsible for the conclusion that the model came to.
Completeness: No steps of reasoning are missing, given the level of detail at which the reasoning is presented (which should be an appropriate level for oversight).
Straightforwardness: Hidden messages aren’t steganographically encoded in the reasoning.
It’s useful to note that we have read and write access to the externalized reasoning process, which is useful for creating tests that probe causal responsibility.
I’ve written up my current test ideas in another document, but I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus. Accordingly I’ve put together a poor man’s CAPTCHA which you’ll have to complete to access it (sorry). Please remove the text “-REMOVE” when you visit the link below:
The tests I’ve developed so far only provide qualitative evidence about the authenticity of a model’s reasoning, and even all the tests taken together are far from foolproof. Developing better tests is one of the major goals of this research direction; so far I’ve only taken a very first stab at it. Real success here will likely involve grappling with difficult questions around what it means for a model to be “actually using” its externalized reasoning, or for a given step of reasoning to be “necessary”, which I have hardly even touched. The process to develop more complete tests may look similar to the builder / breaker methodology described by ARC in the ELK report.
Second condition: create authentic reasoners
It’s not enough to have tests that can confirm if a model is doing authentic reasoning; we may also need to create a model that passes the tests. Designing model architectures and training processes that we think are likely to result in authentic reasoners also provides a kind of defense in depth, alongside the tests of authentic reasoning.
I’ll also note that it’s important that the authentic reasoner models we create be reasonably competitive with those developed by mainstream capabilities, so that they can constitute a safe alternative.
The most straightforward way we might make a model more dependent on externalized reasoning is to give it fewer parameters, possibly by giving it fewer layers of depth. This way, its capacity to use serial compute internally is reduced, and it may be more dependent on the pseudo-serial compute provided by its externalized reasoning. If there are concerns that a sophisticated model could anticipate and thereby evade our reasoning authenticity tests, it’s possible that using a shallower model could assuage that fear. Ideally, the pseudo-serial depth provided by the chain of thought could make up for the hit to performance that would come from reducing the model’s depth, but if that isn’t the case then this approach alone wouldn’t be capabilities competitive.
Another idea comes from a modification of the approach in “Censoring Representations with an Adversary”, where the adversary would use the model’s internal states to try to derive the answer to the question while the model itself is still in the middle of reasoning. Optimizing the model against the adversary’s success should reduce the extent to which the final answer is represented in the model’s internal states, making it more dependent upon the externalized reasoning. For many reasons this is far from a complete solution, including concerns about the distribution that the model is trained on.
I’ve hardly explored this direction at all; the strategies above are only meant to illustrate that it doesn’t seem impossible a priori to develop architectures or training processes that would likely create authentic reasoners. I anticipate that further thought should quickly yield better strategies than those listed.
Third condition: oversee the reasoning
Conditional on the first two goals being accomplished (assessing reasoning authenticity and creating authentic reasoners), there are at least two techniques that could be used to reduce the chance of deploying unaligned AI: filtering a model’s outputs in deployment based on the acceptability of its reasoning, and fine-tuning the model to reason acceptably.
For a model in deployment, we could use automated oversight tools that scan its reasoning process for elements we don’t approve of, like reasoning that relates to convergent instrumental subgoals, deception, or that is non-myopic. These tools could be simpler than the system they’re overseeing, and be run in real time against the model’s reasoning before any actions are taken based on the output. If we didn’t approve of the reasoning, we could prevent the action from being taken and raise an alarm instead.
It’s possible that our default training strategy for models would often result in reasoning that we didn’t approve of. In this case, we could fine-tune the model using RLHF on human approval of the reasoning process itself, and in this way train the model to reason in ways that we approve of. This second approach might start to look something like optimizing against our interpretability technique, which could push the model to obfuscate its reasoning. If we take this path we should be especially wary of steganographic messages being encoded in the reasoning.
Advantages of this approach
To me, the major advantage of this research direction comes from the fact that success would enable us to express approval or disapproval over the model’s actual reasoning process.
I think this direction is particularly exciting right now for other reasons as well:
Externalized reasoning can be studied in current models, even relatively small ones.
Tests of reasoning authenticity may only require inference access to a model, which is often easier to get than weights and activations.
If a model fails these tests, it may constitute a “spooky demo” that’s legible to the wider machine learning research community.
Models for which externalized reasoning is essential may be smaller and cheaper to train than cutting edge large language models.
This research direction also suggests a number of projects that can be worked on right now, in parallel with one another. Some of these include:
Developing and implementing better reasoning authenticity tests.
Training models that constitute adversarial examples for authenticity tests, like models that encode steganographic messages in their reasoning.
Developing and testing strategies to train authentic reasoners.
Developing tools to automate the oversight of externalized reasoning.
De-confusing concepts related to authenticity, reasoning approval, and oversight.
Shortcomings of this approach
Here are some reasons that this research direction might fail:
The first or second conditions may be impossible to achieve.
We might not be able to even define what we mean by “authenticity” to our own satisfaction.
Even with a sound definition, we might not be able to develop empirical tests to detect it.
Even with a solid definition and good tests, we might not be able to train models that pass the tests.
The third condition may be impossible to achieve
It might not be possible to oversee the model’s reasoning process to our satisfaction.
It may not be possible to do so with tools that are simple enough for us to trust.
Reinforcement learning on the acceptability of the reasoning process may cause the reasoning to no longer be authentic.
Other problems may arise
We may fool ourselves into thinking that we’ve developed sound definitions, tests, and oversight tools, but there are actually failure modes that we’ve missed which are exploited by the model.
It might not be possible to find a level of reasoning granularity that both allows the model to reason well and enables reasoning oversight.
The externalized reasoning approach may not work well outside of logical reasoning problems.
Systems developed this way may not be sufficiently competitive with mainline capabilities to constitute a safe alternative.
This research direction is still just a sketch with many details left to be filled in; I look forward to seeing it developed further. I invite anyone who is interested in it or in similar directions to collaborate with one another and with me.
I’d like to thank Alex Gray, William Saunders, Kyle McDonell, Evan Hubinger, Leo Gao, and Owain Evans for conversations relating to this work.
I also extend exceptional thanks to my illustrious flatmates in AI House for their insight, advice, and consistent good humor: Ian McKenzie, Peter Barnett, Vivek Hebbar, and Thomas Kwa.