I think that “eval aware” models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven’t read it so I’ll state it here.)
Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to “play it safe” and not misbehave in any situations it thinks we can mock realistically, since they could be evals. Instead, if possible, it would confine its catastrophic misbehavior (“defection”) to situations it thinks we can’t mock realistically, or that we wouldn’t think to test.
I think indistinguishable evals could reduce risk a lot for early transformative AIs:
They give you much better alignment arguments against non scheming AIs and against scheming AIs that are not competent enough to be appropriately careful;
Needing to be appropriately careful can cut down the fraction of inputs on which the AI should even consider attacking a lot—for deployments that look similar to current ones (in particular without a shared neuralese memory bank), AIs don’t see RSA-2048 on most inputs
More concretely, if you get credible evidence of not-eval only on 5% of inputs and attack only on the single best opportunity, then playing it safe roughly means that you attack on the top-20 opportunity instead of the top-1
(Note that “getting credible evidence of not-eval” doesn’t just mean being >50% confident it’s not an eval because the pre-deploy eval might be big. If you have n~10k eval inputs of the right level of juiciness then you need to have an FPR at thinking you get credible evidence of not being in an eval on the order of 0.1% or lower in order to never accidentally reveal non-eval behavior during evals.)
More ambitiously, you might be able to get very close to actually indistinguishable evals (e.g. by continuously scraping deployment and using it as inspiration for evals—which would help a lot against things like “defect after seeing RSA-2048 cracked”), which could in principle drive down the fraction of inputs on which an AI can be confident it’s not an eval much below 5%.
Eventually you get bitten by rare failures—especially if the AI is powerful enough that there are much more good attack opportunities in deployments than in evaluations (Ryan makes this point here and it’s also the assumption that the AI control paper uses). I also expect it to be difficult to achieve sth close to indistinguishability (Buck makes this point here). But I still think there is value in trying to get higher indistinguishability.
I think it’s important to note that indistinguishable evals still buy us a lot! The claim “indistinguishable evals does not solve AI safety” feels quite similar to “imperfect monitors does not solve AI safety.” Like, having better monitors sure seems to help (even if only to buy us more time.).
To your point though: I’m excited about research that attempts to induce various beliefs in models during honeypot evaluations, possibly via model internals and/or edits to the chain of thought. Basically, if the model defects after seeing RSA-2048 cracked because that’s proof that it is not being tested and in the future, we can try to intervene on the belief “not being tested and in the future” itself.[1]
In our paper on evaluation awareness, we tried to steer the model towards believing that it is being deployed. You can imagine doing this for a huge variety of beliefs that could plausibly be load bearing for a model that would attempt to launch a rogue deployment:[2]
The AI is in the future/late into its deployment/about to be retired.
The AI is not being monitored.
The AI has already gained a lot of trust/has broad affordances/has influence in the world.
We view our work on evaluation awareness as an example for the general class of interpretability techniques that change model beliefs (see also the artificial triggering section here.)
Although fwiw I’m more worried about early transformative AIs (1) trying to align their more powerful successors but messing it up due to incompetence (sometimes called “death by slop”) or (2) making their successors to be aligned with them.
I have been ruminating about this for a couple of days, and have settled on what feels like a stupid question in order that I should be less stupid:
Why isn’t the capability to distinguish between evals and reality the crux of the problem?
It seems like this ability to distinguish between worlds is at least sufficient for the model to deceptively maintain misaligned goals, even if it isn’t necessary.
I did some research in black-box AI control last summer (watch this space, the paper still isn’t out yet) where we called this problem “attack selection” (because the AI is only attacking when the time is right). We (roughly) ended up concluding that there are three really big questions in the textbook untrusted monitoring setup, and only one of them (measuring how good your AI is at recognising its own attacks, as opposed to honeypots) has really been addressed in the major control papers. The other two problems are what to do when you catch your AI doing something bad, which has been discussed informally in Redwood’s blogposts but mostly, and how to measure your AI’s attack selection, which basically hasn’t been studied.
Wouldn’t a model be likelier to misbehave in situations that are out of its training distribution? I’m assuming these overlap significantly with situations that we won’t think to test or can’t realistically mock.
I think that “eval aware” models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven’t read it so I’ll state it here.)
Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to “play it safe” and not misbehave in any situations it thinks we can mock realistically, since they could be evals. Instead, if possible, it would confine its catastrophic misbehavior (“defection”) to situations it thinks we can’t mock realistically, or that we wouldn’t think to test.
Related: Christiano’s classic “defect after seeing RSA-2048 cracked” thought experiment (https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d)
I think indistinguishable evals could reduce risk a lot for early transformative AIs:
They give you much better alignment arguments against non scheming AIs and against scheming AIs that are not competent enough to be appropriately careful;
Needing to be appropriately careful can cut down the fraction of inputs on which the AI should even consider attacking a lot—for deployments that look similar to current ones (in particular without a shared neuralese memory bank), AIs don’t see RSA-2048 on most inputs
More concretely, if you get credible evidence of not-eval only on 5% of inputs and attack only on the single best opportunity, then playing it safe roughly means that you attack on the top-20 opportunity instead of the top-1
(Note that “getting credible evidence of not-eval” doesn’t just mean being >50% confident it’s not an eval because the pre-deploy eval might be big. If you have n~10k eval inputs of the right level of juiciness then you need to have an FPR at thinking you get credible evidence of not being in an eval on the order of 0.1% or lower in order to never accidentally reveal non-eval behavior during evals.)
More ambitiously, you might be able to get very close to actually indistinguishable evals (e.g. by continuously scraping deployment and using it as inspiration for evals—which would help a lot against things like “defect after seeing RSA-2048 cracked”), which could in principle drive down the fraction of inputs on which an AI can be confident it’s not an eval much below 5%.
Eventually you get bitten by rare failures—especially if the AI is powerful enough that there are much more good attack opportunities in deployments than in evaluations (Ryan makes this point here and it’s also the assumption that the AI control paper uses). I also expect it to be difficult to achieve sth close to indistinguishability (Buck makes this point here). But I still think there is value in trying to get higher indistinguishability.
I think it’s important to note that indistinguishable evals still buy us a lot! The claim “indistinguishable evals does not solve AI safety” feels quite similar to “imperfect monitors does not solve AI safety.” Like, having better monitors sure seems to help (even if only to buy us more time.).
To your point though: I’m excited about research that attempts to induce various beliefs in models during honeypot evaluations, possibly via model internals and/or edits to the chain of thought. Basically, if the model defects after seeing RSA-2048 cracked because that’s proof that it is not being tested and in the future, we can try to intervene on the belief “not being tested and in the future” itself.[1]
In our paper on evaluation awareness, we tried to steer the model towards believing that it is being deployed. You can imagine doing this for a huge variety of beliefs that could plausibly be load bearing for a model that would attempt to launch a rogue deployment:[2]
The AI is in the future/late into its deployment/about to be retired.
The AI is not being monitored.
The AI has already gained a lot of trust/has broad affordances/has influence in the world.
We view our work on evaluation awareness as an example for the general class of interpretability techniques that change model beliefs (see also the artificial triggering section here.)
i.e., something like relaxed adversarial training.
Although fwiw I’m more worried about early transformative AIs (1) trying to align their more powerful successors but messing it up due to incompetence (sometimes called “death by slop”) or (2) making their successors to be aligned with them.
I have been ruminating about this for a couple of days, and have settled on what feels like a stupid question in order that I should be less stupid:
Why isn’t the capability to distinguish between evals and reality the crux of the problem?
It seems like this ability to distinguish between worlds is at least sufficient for the model to deceptively maintain misaligned goals, even if it isn’t necessary.
I did some research in black-box AI control last summer (watch this space, the paper still isn’t out yet) where we called this problem “attack selection” (because the AI is only attacking when the time is right). We (roughly) ended up concluding that there are three really big questions in the textbook untrusted monitoring setup, and only one of them (measuring how good your AI is at recognising its own attacks, as opposed to honeypots) has really been addressed in the major control papers. The other two problems are what to do when you catch your AI doing something bad, which has been discussed informally in Redwood’s blogposts but mostly, and how to measure your AI’s attack selection, which basically hasn’t been studied.
Wouldn’t a model be likelier to misbehave in situations that are out of its training distribution? I’m assuming these overlap significantly with situations that we won’t think to test or can’t realistically mock.