being honest in confessions is the path of least resistance
[...]
Confessions are [...] easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.
I think that’s likely true today, but I don’t think this will hold very far.
The typical scary situation with ELK is when AIs
do something wrong that would be hard for a human to notice (e.g. measurement tampering, sandbagging, etc.)
when asked to be a judge/confession/debate, AIs say what a human would have said if a human looked at easy-to-measure signals (either because of scheming, or benign underelicitation because that’s what AIs learned to do in pretraining / RL against easy-to-measure signals)
In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify. If the AI says “this is fine” (or “this [mostly benign thing] is the worst thing about the transcript”), it is as hard to verify as to check the correctness of the main output.
I agree that confession training can help elicit honest confessions that a human would not have spotted when AIs are bad at predicting what dishonest statements they can get away with (similar to RLHF / Open consultancy—the linked post provides some more explanations and toy models on the topic), and that will probably be powerful for early AGI alignment (especially when it comes to avoiding egregious hacks), but I don’t expect it will help that much with ASI alignment or with avoiding underelicitation on fuzzy tasks.
At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!
I haven’t read the full ELK report, just Scott Alexander’s discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don’t seem clearly true for LLMs.
Scott writes:
Suppose the simulated thief has hit upon the strategy of taping a photo of the diamond to the front of the camera lens.
At the end of the training session, the simulated thief escapes with the diamond. The human observer sees the camera image of the safe diamond and gives the strategy a “good” rating. The AI gradient descends in the direction of helping thieves tape photos to cameras.
It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him.
So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”.
In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal—a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky?
It seems to me that this is assuming that our training is creating the AI’s policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren’t, and unless we are very careful to only reward the things we want and none of the ones we don’t want, it’s going to end up doing things we don’t want.
I don’t know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with “incomplete instructions, minimal training and unrealistic time limits to complete tasks” and say things like “[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that”. Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they’d still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is “upvoting” and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like “a human genuinely trying to do its best at task X”, “a deceitful human”, or “an honest human”.
Scott writes:
In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
This might happen. It might also happen that the AI contains a “genuinely protect the diamond” persona and a “manipulate the humans to believe that the diamond is safe” persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the “manipulate the humans” persona… but that if the “genuinely protect the diamond” persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn’t matter if there’s some noise and upvoting of the “manipulate the humans” persona, as long as the “genuinely protect the diamond” persona gets more upvotes overall. And if the “genuinely protect the diamond” persona had been sufficiently upvoted from the start, the “manipulate the humans” one might end up with such a low prior probability that it’d effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential “honestly report everything that I’ve done” persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it’d get widely linked to the rest of the model’s internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn’t matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).
I think if human-level AIs were going to be capable of making great strides in scalable alignment work, we would have seen more progress from human-level humans. The fact that a large chunk of the field has converged on strategies like “Get another person to do the work” (i.e. fieldbuilding work, organizing mentorships, etc.) or “Get an AI to do the work” (i.e. AI control, superalignment) or “Stop or slow the building of AGI and/or make the builders of it more responsible” (i.e. policy work) is a very bad sign.
The total progress being made on the real meat of alignment is very low, compared to the progress being made in capabilities. I don’t see why we should expect this, or the distribution of resources, to suddenly flip in favour of alignment during the middle of the singularity once human-level AIs have been developed, and everything is a thousand times more stressful and the race dynamics are a thousand times worse.
If AIs are 100% good at predicting the statements they can get away with respect to the judge, then may well learn the strategy of making such statements. But if they are not 100% at it, the safer approach is to just come clean in the confession when you did something wrong. Since you can give the evidence that you did something wrong, it’s highly unlikely the judge won’t believe you if you confess. And since st least sometimes you might not be able to do it (especially if the judge might get more test time compute than the policy), the path of least resistance could well be that the model learns to just be honest.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints. • The accuracy of the analysis as to whether each instruction/objective/constraint was complied with. • Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that. What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.
In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify.
Couldn’t you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn’t have advance knowledge. Then as long as the system verifying the “honesty reports” was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn’t require too much training for a property like “report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions” to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
I think this is plausibly a big problem against competent schemers.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?
The use that wasn’t obvious from the ELK framing might be fixing issues with RL environments, grader prompts, canonical solutions, etc. that ultimately enable reward hacking and thus motivate dishonest behavior. Confessions can serve as bug reports about the datasets, not centrally about the AI. They likely fail to catch a lot of issues with the AI, but substantially improving the datasets might fix some of the things they failed to catch about the AI.
If a single model is end-to-end situationally aware enough to not drop hints of the most reward-maximizing bad behaviour in chain of thought, I do not see any reason to think it would not act equally sensibly with respect to confessions.
I talk about this in this comment. I think situational awareness can be an issue, but it is not clear that a model can “help itself” from being honest in neither COT nor confessions.
I think that’s likely true today, but I don’t think this will hold very far.
The typical scary situation with ELK is when AIs
do something wrong that would be hard for a human to notice (e.g. measurement tampering, sandbagging, etc.)
when asked to be a judge/confession/debate, AIs say what a human would have said if a human looked at easy-to-measure signals (either because of scheming, or benign underelicitation because that’s what AIs learned to do in pretraining / RL against easy-to-measure signals)
In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify. If the AI says “this is fine” (or “this [mostly benign thing] is the worst thing about the transcript”), it is as hard to verify as to check the correctness of the main output.
I agree that confession training can help elicit honest confessions that a human would not have spotted when AIs are bad at predicting what dishonest statements they can get away with (similar to RLHF / Open consultancy—the linked post provides some more explanations and toy models on the topic), and that will probably be powerful for early AGI alignment (especially when it comes to avoiding egregious hacks), but I don’t expect it will help that much with ASI alignment or with avoiding underelicitation on fuzzy tasks.
At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!
I haven’t read the full ELK report, just Scott Alexander’s discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don’t seem clearly true for LLMs.
Scott writes:
It seems to me that this is assuming that our training is creating the AI’s policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren’t, and unless we are very careful to only reward the things we want and none of the ones we don’t want, it’s going to end up doing things we don’t want.
I don’t know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with “incomplete instructions, minimal training and unrealistic time limits to complete tasks” and say things like “[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that”. Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they’d still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is “upvoting” and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like “a human genuinely trying to do its best at task X”, “a deceitful human”, or “an honest human”.
Scott writes:
This might happen. It might also happen that the AI contains a “genuinely protect the diamond” persona and a “manipulate the humans to believe that the diamond is safe” persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the “manipulate the humans” persona… but that if the “genuinely protect the diamond” persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn’t matter if there’s some noise and upvoting of the “manipulate the humans” persona, as long as the “genuinely protect the diamond” persona gets more upvotes overall. And if the “genuinely protect the diamond” persona had been sufficiently upvoted from the start, the “manipulate the humans” one might end up with such a low prior probability that it’d effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential “honestly report everything that I’ve done” persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it’d get widely linked to the rest of the model’s internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn’t matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
https://arxiv.org/html/2505.13995v2
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).
I think if human-level AIs were going to be capable of making great strides in scalable alignment work, we would have seen more progress from human-level humans. The fact that a large chunk of the field has converged on strategies like “Get another person to do the work” (i.e. fieldbuilding work, organizing mentorships, etc.) or “Get an AI to do the work” (i.e. AI control, superalignment) or “Stop or slow the building of AGI and/or make the builders of it more responsible” (i.e. policy work) is a very bad sign.
The total progress being made on the real meat of alignment is very low, compared to the progress being made in capabilities. I don’t see why we should expect this, or the distribution of resources, to suddenly flip in favour of alignment during the middle of the singularity once human-level AIs have been developed, and everything is a thousand times more stressful and the race dynamics are a thousand times worse.
If AIs are 100% good at predicting the statements they can get away with respect to the judge, then may well learn the strategy of making such statements. But if they are not 100% at it, the safer approach is to just come clean in the confession when you did something wrong. Since you can give the evidence that you did something wrong, it’s highly unlikely the judge won’t believe you if you confess. And since st least sometimes you might not be able to do it (especially if the judge might get more test time compute than the policy), the path of least resistance could well be that the model learns to just be honest.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.
Couldn’t you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn’t have advance knowledge. Then as long as the system verifying the “honesty reports” was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn’t require too much training for a property like “report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions” to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?
The use that wasn’t obvious from the ELK framing might be fixing issues with RL environments, grader prompts, canonical solutions, etc. that ultimately enable reward hacking and thus motivate dishonest behavior. Confessions can serve as bug reports about the datasets, not centrally about the AI. They likely fail to catch a lot of issues with the AI, but substantially improving the datasets might fix some of the things they failed to catch about the AI.
If a single model is end-to-end situationally aware enough to not drop hints of the most reward-maximizing bad behaviour in chain of thought, I do not see any reason to think it would not act equally sensibly with respect to confessions.
I talk about this in this comment. I think situational awareness can be an issue, but it is not clear that a model can “help itself” from being honest in neither COT nor confessions.