I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
https://arxiv.org/html/2505.13995v2
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.