I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints. • The accuracy of the analysis as to whether each instruction/objective/constraint was complied with. • Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that. What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.