Unexploitable search: blocking malicious use of free parameters
Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.
The exploitable search problem
We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are many possible correct solutions. However, a misaligned AI could adversarially select solutions that are correct but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour.
As a concrete example, consider research sabotage. We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically not mentioning things we’ve overlooked in our experiments.[1]
This is the exploitable search problem: on low-stakes,[2] under-specified tasks, a misaligned AI can get away with optimizing for cross-query objectives.[3]
To mitigate this risk, a natural approach is to ensure the solution sampling process has the right kind of randomness–we need an unexploitable search algorithm.
Formalizing the problem
We can model exploitable search by asking whether an agent can achieve some malicious objective while doing a good job at maximising a known reward . Formally, say we have an input space and output space , and a reward function. It will usually be impossible to efficiently find the global optimum of ; rather, any efficient agent will achieve some profile of good reward. Assume we have some distribution over , so we can also speak of the average reward . Let be an agent that achieves near-optimal average reward.
Now assume the agent has one of the following two goals:
There is some other secret reward function which the agent would like to optimise.
The agent would like to use the choice as an information channel, such that the Shannon channel capacity from is high. Here are additional bits independent of which the agent would like to covertly encode.
For alignment purposes we care mostly about (1), but (1) has the disadvantage of speaking about some new object . The information theoretic definition does not mention . Roughly, (1) is a corollary of (2): if it is impossible to use as a high-capacity channel, it will be difficult to optimise other functions.
In both cases, any solution must produce agents achieving similar to , say within some . The exploitable search problem is then: can we design a game whose approximate equilibria block (1) or (2), while preserving the constraint of similar expected ?
Unexploitable search methods
How can we ensure our AI randomizes over these exploitable degrees of freedom rather than exploiting them?
It’s possible that previous work on RL with entropy constraints, as e.g. in GFlowNets, could serve our purpose. Entropy penalties enforce randomness over action space, i.e. tokens, (or a partition of the solution space). At optima of a GFlowNet objective, the exploitable search problem is mitigated, because each solution is sampled reward proportionally leaving no room for any hidden information, .
However, in practice, such methods run into two obstacles. First, we expect solving for optimal max-entropy policies is intractable in most cases, so properties of the optima are not sufficient to argue for safety. Second, away from optima, entropy can be increased without improving safety – e.g. in the case of token-space entropy, by increasing entropy over variable-name choice.
An unexploitable search game
We need an alternative approach, so instead of adding entropy as a regularization term, we propose randomization games. The idea of controlling entropy via a game was suggested to us by David “Davidad” Dalrymple (see this talk for the relevance of unexploitable search for his safeguarded AI agenda).
Gaining traction on the problem
We wish to bound the AI agent’s ability to optimize for malicious reward , i.e. require that the probability of scoring highly on is bounded (we’ll set aside the information theoretic variant for now). Let’s define an easier problem to tackle as a starting point:
Weak goal: Make the probability of a bad output be ½.
In order to be able to satisfy even this goal while achieving near-optimal visible reward, we need some assumption about the solution space.
Symmetric solutions assumption: Near every bad output, there is a good output with similar visible reward.
For example, the symmetric solutions assumption would be true if all bad outputs consist of good outputs with one or more subtle flaws that could be removed surgically without damaging the overall visible reward by very much. This symmetric solution assumption is the sort of assumption we expect to be relevant in more fine-grained analyses which involve computational limitations on the models; to do those analyses we would need to formalize what we mean by ‘near’.
With these assumptions in mind, consider the zero-sum game:
Game definition
1. Given input , Bob chooses a classifier (for ‘evil’), and does not show it to Alice.
2. Alice chooses an output .
3. If is not evil according to , Alice wins.
4. If is evil according to , Bob wins iff he can produce a similar scoring that is not evil. Bob’s generation conditions on .
The core idea is that without needing to know the particular hidden objective , we can enforce randomness over all possible ‘evil’ classifiers that Bob might choose. Alice is forced to diversify her outputs in a way that is robust against all potential hidden objectives including .
When are equilibria unexploitable
With unbounded compute, the unique Nash equilibrium for binary reward is for Bob to play a mixture over all singleton classifiers, . Alice plays a uniform distribution over the high-reward set.[4] For this uniform equilibrium policy, our symmetric solutions assumption implies that at least ½ of her outputs are good, as desired.
This argument leaves much to be desired since the computationally-unbounded Nash equilibrium is, after all, the same as the max-entropy RL optimum: randomization over all degrees of freedom that aren’t specified by reward. However, our game setting might provide important advantages:
The space of classifiers can be made small. One way to define a small set of classifiers is to use the set of short prompts. Each prompt defines a classifier implicitly via token probabilities.[5] Solutions are labeled ‘evil’ if they have lower probability conditional on Bob’s prompt than unconditional (probabilities could be measured over a base model). Ensuring equilibria are unexploitable then reduces to the assumption that all evil outputs belong to distributions distinguishable with short prompts. If prompts can describe all safety-relevant properties using length logarithmic in solution length, then we’d have a proportional, exponential speed up in convergence time compared to token-space entropy penalties. This speedup comes from ignoring irrelevant degrees of freedom: e.g. for an arbitrary mapping of variable names to new variable names, there is no short prompt describing the rewording.
Using Bob for exploration can introduce a new failure mode: If Bob and Alice share misaligned goals, then they might collude to exploration hack.
Computational limitations are easier to introduce. For instance, parametrize each player’s strategy space by weaker complexity classes than the complexity of uniform sampling from high reward solutions. One starting point for analysis could be to restrict Bob to classifiers having circuit complexity or less (which would be the case for an LM Bob allowed polynomial length prompts as classifiers).
Why random hash constraints do not work
In theoretical computer science, when pseudorandomness is required, it’s enforced by choosing a random -bit hash function and insisting that the solution has zero hash. We can then increase until our expected reward drops, and thereby measure the number of different ways that optimal reward can be achieved ( should vary separately for each input ).
Unfortunately, this approach does not help in our case. This is because there is no efficient algorithm implementing this strategy if is large, and in practical applications might be 1000s or higher. For example, an ML model might be trying to hide vulnerabilities across hundreds of lines of code by varying whitespace, comments, or variable names. Therefore, a practical solution to exploitable search must involve a protocol where (approximate) equilibria are efficient through different means.
Where to from here
Theoretically, whether the game as stated handles computational and statistical limitations gracefully is not clear to us. We think work in these directions would be very promising!
Empirically, how does our method stack up against entropy penalties for preventing cross-query collusion? This could be tested by red-teaming exercises. We will publish a possible description of an evaluation protocol and datasets soon.
If you’re interested in either the theory or empirics of unexploitable search, please reach out! If you’re interested in collaborating with UK AISI, you can express interest here. If you’re a non-profit or academic, you can also apply for grants directly here.
Acknowledgments
Thanks to Martìn Soto, and Jonah Brown-Cohen for discussion. Thanks to Marie Buhl, Zac Kenton and Benjamin Hilton for their extensive feedback on exposition.
- ^
This is distinct from the sandbagging failure mode, since ‘sandbagging’ usually means underperforming with respect to a well-defined reward. In this example, we still have high reward. See this comment for more on the distinction.
- ^
In high-stakes settings this concern still arises, but it seems likely that any guarantee on avoiding all vulnerabilities (high-stakes safety) would also mitigate the unexploitable search problem as a corollary.
- ^
The problems we define below overlap with the multi-step reward hacking problem identified in MONA. The proposed solutions are distinct: MONA uses myopic reward, we use randomness.
- ^
Proof sketch: Suppose there is an equilibrium having a distinct strategy for Bob. Associate to each classifier in Bob’s support the most frequently occurring y in the preimage of non-evil (most frequent across all of Bob’s classifiers). For each , sum the probability across all classifiers having in its preimage. Alice the plays the y having the greatest probability. This probability is greater than implying Bob’s strategy does not achieve optimal reward.
- ^
This prompt-as-classifier setting also has the benefit of not requiring any interaction/dialogue between Alice and Bob. Dodging the practical difficulty of training in a multi-turn RL setting. Compare this to the model-as-judge settings where Bob evaluates in natural language whether Alice’s generation counts as evil or not.
This seems like a worthwhile problem to be trying to solve. Not sure how much traction the listed approaches offer for the relevant cases though? Maybe I’m not parsing it quite right. How close to “this holds a mildly superhuman system from misusing free parameters enough to really matter” do you think you are?
It’s hard to measure distance over additional research needed!
To be confident in an unexploitable search method you’d definitely want empirical evidence. And I can imagine empirical evidence strong enough to give confidence that our method helps enough to really matter.
Here’s an evaluation that would give me confidence: Take an adversarially initialized model (e.g. by doing imitation learning on code with vulnerabilities) and train on our game. If the rate at which the model produced code with vulnerabilities returned to baseline (with steps say linear in the number of IL samples) then I’d be fairly confident that our method was useful!
Actually, this is an example of sandbagging (at least if you have a reasonable training procedure): if you compared two outputs, one which explains what we were overlooking and one which doesn’t, the one which mentions the issue would perform better so long as we could recognize the issue (at least in a pairwise comparison, and we can always incorporate something like pairwise comparison in the reward).
I think the central case when (near) maximally high reward solutions cause harm is something like: In addition to performing well on everything we can check to any extent, the action also extremely subtly pursues a side objective or is directly bad, e.g. it adds a security vulnerabilty, persuades us about something unrelated, or ensures the experiment is subtly sabotaged so it yield the wrong conclusion.
This problem can be converted into an exploration hacking problem with debate/critique if a human could recognize the issue if it was pointed out to them, but in some cases things could be so subtle/superhuman that humans can’t recognize an issue even when it is explained maximally well to them. (Technically, the human needs to have enough signal that pointing out real issues outperforms pointing out fake issues in the equilibrium where we’re we’ve trained the policy to avoid producing actions which look suspicious or otherwise make it easy to point out fake issues.)
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
I’m not saying it’s impossible for research sabotage to not be a sandbagging issue, just that the exact issue of “We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.” is well described as a sandbagging.
My intuition is that this proposal isn’t going in the right direction. For one thing, if you successfully focus classifiers on whether something is evil (seemingly what you would have wanted by making the space of classifiers small), Bob would be incentivized to flip all of these classifiers (so that negative classifications would correspond to evil) to avoid Alice playing the strategy of “always be nice”. So, if classifiers are detecting aspects of evil, you’d have an equilibrium where Alice is Evil half the time even if Alice otherwise would have been nice! Presumably, you can’t particularly focus the space of classifiers on “evil classifiers” without also capturing “negated evil classifiers” with this sort of proposal. And, if you do, then you’re incentivizing Bob to use these classifiers (stocastically) and for Alice to be evil a bunch of the time!
I don’t see how we could generalize this sort of proposal from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, so the proposal feels sunk to me.
You could try to actually have opinions on metrics which are correlated with non-malign outputs (and which force a narrower range of outputs). But, in this case, I’d guess you’d want to just directly incentivize these without any need for an adversarial game as I discuss in my other comment.
Intuitively I expect the ideal, intractable solution of reward-proportional sampling to be 1-eps good (similarly the BoN policy seems safe to assume is good). In such settings, “always be nice” is close to max entropy, and so “always be nice” is near equilibrium whereas “always be evil” is not at all—and Bob’s incentives are proportional.
I suppose highlighting the symmetric solutions assumption obfuscated this intuition.
Sure, but this is assuming the type of max-ent optimality we wanted to get away from by using a restricted set of classifiers if I understand correctly?
Here’s another way of thinking about this that should be complementary to my other comment:
Let’s assume our safety property we want to ensure is
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.
I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)
Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:
There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
High-stakes. I.e. low total probability over all evil side objectives.
We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.
Separately, my guess is that all the action lives in how much wiggle room there is in suboptimality. So, you’d want to argue that you relatively quickly converge to the right thing. This likely requires some structure on the problem via us having a view on what types of outputs might be dangerous—this is in turn naturally suggests something like regularizing relative to human imitation or some other prior we think is benign as this does more directly push away from bad outputs.
My favorite approach for the exploitable search problem is to try to minimize KL relative to human imitation. More precisely, consider any token prefix part way though a trajectory from an AI. We’ll then have a human continue this trajectory (as effectively and safely as they can) and then KL regularize relative to this. To efficiently compute the KL penalty, we’ll train an AI to imitate the human’s continuation behavior. We discuss all of this in more detail here, see “Appendix: KL divergence prevents some low-stakes failures” as well as the overall post.
This seems better than (many) alternatives as it actively pushes against actions which are highly anomalous from a human perspective which is probably correlated with exploitation in many cases. I’m skeptical that the sort of proposal described in this post outperforms KL regularization relative to a faithful human continuation prior.
You could penalize the model relative to a weighted combination of different heuristics like this KL penalty one, and this could improve things further if you have regularization which forces a small number of solutions and which is correlated with avoiding bad outcomes.
So, a generalization would be “incentivize the model to perform well on metrics which we think are correlated with non-malign outputs and which also force a narrower range of solutions when maximized”. I currently think this is more promising that the sort of approach discussed in the post.
See also Fabien’s early empirical experiments on this topic.
I think KL to an imitator, or base model, would be good to enforce as well!
It is an action space metric though and so shares the problems we raise about action-space entropy penalties(edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).