For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that. What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflex. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.