At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!
I haven’t read the full ELK report, just Scott Alexander’s discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don’t seem clearly true for LLMs.
Scott writes:
Suppose the simulated thief has hit upon the strategy of taping a photo of the diamond to the front of the camera lens.
At the end of the training session, the simulated thief escapes with the diamond. The human observer sees the camera image of the safe diamond and gives the strategy a “good” rating. The AI gradient descends in the direction of helping thieves tape photos to cameras.
It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him.
So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”.
In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal—a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky?
It seems to me that this is assuming that our training is creating the AI’s policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren’t, and unless we are very careful to only reward the things we want and none of the ones we don’t want, it’s going to end up doing things we don’t want.
I don’t know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with “incomplete instructions, minimal training and unrealistic time limits to complete tasks” and say things like “[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that”. Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they’d still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is “upvoting” and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like “a human genuinely trying to do its best at task X”, “a deceitful human”, or “an honest human”.
Scott writes:
In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy.
This might happen. It might also happen that the AI contains a “genuinely protect the diamond” persona and a “manipulate the humans to believe that the diamond is safe” persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the “manipulate the humans” persona… but that if the “genuinely protect the diamond” persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn’t matter if there’s some noise and upvoting of the “manipulate the humans” persona, as long as the “genuinely protect the diamond” persona gets more upvotes overall. And if the “genuinely protect the diamond” persona had been sufficiently upvoted from the start, the “manipulate the humans” one might end up with such a low prior probability that it’d effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential “honestly report everything that I’ve done” persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it’d get widely linked to the rest of the model’s internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn’t matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).
I think if human-level AIs were going to be capable of making great strides in scalable alignment work, we would have seen more progress from human-level humans. The fact that a large chunk of the field has converged on strategies like “Get another person to do the work” (i.e. fieldbuilding work, organizing mentorships, etc.) or “Get an AI to do the work” (i.e. AI control, superalignment) or “Stop or slow the building of AGI and/or make the builders of it more responsible” (i.e. policy work) is a very bad sign.
The total progress being made on the real meat of alignment is very low, compared to the progress being made in capabilities. I don’t see why we should expect this, or the distribution of resources, to suddenly flip in favour of alignment during the middle of the singularity once human-level AIs have been developed, and everything is a thousand times more stressful and the race dynamics are a thousand times worse.
At the end of the day this is simply not a serious alignment proposal put forward by people who are seriously thinking about alignment. This entire approach is (mostly) a rediscovery of the starting point for the ELK contest from four years ago; the authors have not even considered the very basic problems with this approach, problems which Christiano et. al. pointed out at the time four years ago, because that was the point of the contest!
I haven’t read the full ELK report, just Scott Alexander’s discussion of it, so I may be missing something important. But at least based on that discussion, it looks to me like ELK might be operating off premises that don’t seem clearly true for LLMs.
Scott writes:
It seems to me that this is assuming that our training is creating the AI’s policy essentially from scratch. It is doing a lot of things, some of which are what we want and some of which aren’t, and unless we are very careful to only reward the things we want and none of the ones we don’t want, it’s going to end up doing things we don’t want.
I don’t know how future superintelligent AI systems work, but if LLM training was like this, they would work horrendously much worse than they do. People paid to rate AI answers report working with “incomplete instructions, minimal training and unrealistic time limits to complete tasks” and say things like “[a]fter having seen how bad the data is that goes into supposedly training the model, I knew there was absolutely no way it could ever be trained correctly like that”. Yet for some reason LLMs still do quite well on lots of tasks. And even if all raters worked under perfect conditions, they’d still be fallible humans.
It seems to me that LLMs are probably reasonably robust to noisy reward signals because a large part of what the training does is “upvoting” and tuning existing capabilities and simulated personas rather than creating them entirely from scratch. A base model trained to predict the world creates different kinds of simulated personas whose behavior that would explain the data it sees; these include personas like “a human genuinely trying to do its best at task X”, “a deceitful human”, or “an honest human”.
Scott writes:
This might happen. It might also happen that the AI contains a “genuinely protect the diamond” persona and a “manipulate the humans to believe that the diamond is safe” persona, and that the various reward signals are upvoting these to different degrees. And that such a random process of manipulation does end up upvoting the “manipulate the humans” persona… but that if the “genuinely protect the diamond” persona has gotten sufficiently upvoted by other signals, it still ends up being the dominant one. Then it doesn’t matter if there’s some noise and upvoting of the “manipulate the humans” persona, as long as the “genuinely protect the diamond” persona gets more upvotes overall. And if the “genuinely protect the diamond” persona had been sufficiently upvoted from the start, the “manipulate the humans” one might end up with such a low prior probability that it’d effectively never end up active.
Now of course none of this is a rigorous proof that things would work, and with our current approaches we still see a lot of reward hacking and so on. But it seems to me like a reasonable possibility that there could be a potential “honestly report everything that I’ve done” persona waiting inside most models, such that one could just upvote it in a variety of scenarios and then it’d get widely linked to the rest of the model’s internals so as to always detect if some kind of deception was going on. And once that had happened, it wouldn’t matter if some of the reward signals around honesty were noisy, because the established structure was sufficiently robust and general against the noise.
I agree with most of what you said here; I also think your treatment of the problem is better than the original confession report!
(I did the ELK contest at the time, but I didn’t win any money, so my understanding may be subject to reasonable doubt)
That being said, there’s a difference between noise and bias in AI training data. ELK isn’t worried about noisy signals, but biased signals. LLMs are very resistant to noise in training, but not bias. For example, LLM RLHF does cause LLMs to pick up on biases in the training data.[1] A good example is gendered bias in relationship advice, wherein LLMs were more sympathetic when a “boyfriend” was mentioned as opposed to a “girlfriend”.[2]
The reason for this is that the ELK problem is not about a distinction between “manipulate” and “protect”, it’s about a distinction between “simulate what a human would say, having read the output” and “tell the truth about my own internal activations”. In any situation where the “truth” persona gets upvoted, the “simulate” persona also gets upvoted, AND there are scenarios where the “truth” persona gets down-voted while the “simluate” persona gets upvoted. This is different from having noisy labels which sometimes push your model in the wrong direction; in this case the problem is more like having a bias away from “truth” and towards “simulate”. Your only hope is that the “truth” persona started out with more weight than the “simulate” one.
Which personas/circuits get upvotes and downvotes during which parts of training is an extremely subtle and difficult topic to work with. You might argue that the “truth” persona will start off with an advantage, since it’s a specific example of good behaviour, which is generally RLed into the model. On the other hand, you might argue that the specific task of “look at my own activations and tell the truth about them” is not something which really ever comes up during RLHF, while “simulate what a human would say, having read the preceding text” is a huge chunk of the pretraining objective.[3]
Either way I expect this to be one of those things which naturally gets worse over time without specific mitigations (like reward hacking/specification gaming/aggressively pursing whatever seems to be the current RLVR objective) if you just keep scaling up confession training. Since it involves deception, it’s also a case where the worse the problem gets, the harder it is to catch. Not good!
Originally I was going to use the Nigerian explanation for the “delve” example but NEVER MIND I GOT CLAUDE TO LOOK THAT UP AND IT’S JUST ALL MADE UP! THE GUARDIAN ARTICLE WHICH STARTED IT ONLY INTERVIEWED PEOPLE FROM KENYA AND UGANDA, THERE’S NOT EVEN ANY EVIDENCE THAT ANY PARTICULAR ENGLISH VERSION CONTAINS THE SAME WORDS THAT LLMS LOVE TO USE.
https://arxiv.org/html/2505.13995v2
The analogy being between truth:simulator::good-relationship-advice:redditor-simulator. Giving good relationship advice is probably rewarded maybe 80% of the time, but giving an exact simulation of what a redditor would say about a relationship advice is rewarded 100% of the time. Overall, the LLM learns to become a redditor-simulator rather than a good relationship advice giver.
Isn’t this pretty well mitigated by having a range of scenarios, all where the AI lacks perfect knowledge of exactly how the human is evaluating the scenario, such that the simulator has additional assumptions upon which they can be mistaken? You just need the humans to not be so clueless and so predictable that guessing the monitoring setup and then simulating the humans is better than straightforward reporting of the real state. Or another way, some of this is just an artifact of the scenario being posed with perfect knowledge for the AI about key aspects of the setup on which the simulator should have to guess but the honest AI wouldn’t care.
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).
I think if human-level AIs were going to be capable of making great strides in scalable alignment work, we would have seen more progress from human-level humans. The fact that a large chunk of the field has converged on strategies like “Get another person to do the work” (i.e. fieldbuilding work, organizing mentorships, etc.) or “Get an AI to do the work” (i.e. AI control, superalignment) or “Stop or slow the building of AGI and/or make the builders of it more responsible” (i.e. policy work) is a very bad sign.
The total progress being made on the real meat of alignment is very low, compared to the progress being made in capabilities. I don’t see why we should expect this, or the distribution of resources, to suddenly flip in favour of alignment during the middle of the singularity once human-level AIs have been developed, and everything is a thousand times more stressful and the race dynamics are a thousand times worse.