Steven Byrnes comments on Thoughts on “AI is easy to control” by Pope & Belrose

Steven Byrnes 30 Apr 2025 17:07 UTC
12 points
0
Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.
To add on a bit to what I wrote in that post, and reply more directly…
So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.” Right so far?
That might or might not make sense for an LLM. I don’t think it makes sense for brain-like AGI.
In particular, the reward function is just looking at what the model does, not what its motivations are. “Synthetic data to make misaligned agents reveal themselves safely” doesn’t seem to make sense in that context. If the reward function incentivizes the AI to say “I’m blowing the whistle on myself! I was out to get you!” then the AI will probably start saying that, even if it’s false. (Or if the reward function incentivizes the AI to only whistle-blow on itself if it has proof of its own shenanigans, then the AI will probably keep generating such proof and then showing it to us. …And meanwhile, it might also be doing other shenanigans in secret.)
Or consider trying to incentivize the AGI for honestly reporting its intentions. That’s defined by the degree of match or mismatch between inscrutable intentions and text outputs. How do you train and apply the reward model to detect that?
More broadly, I think there are always gonna be ways for the AI to be sneaky without the reward function noticing, which leads to a version of “playing the training game”. And almost no matter what the detailed “training game” is, an AI can do a better job on it by secretly creating a modified copy to self-reproduce around the internet and gain maximal money and power everywhere else on Earth, if that’s possible to do without getting caught. So that’s pretty egregious scheming.
“Alignment generalizes further than capabilities” is kinda a different issue—it’s talking about the more “dignified” failure mode of generalizing poorly to new situations and environments, whereas I’m claiming that we don’t even have a solution to egregious scheming within an already-existing test environment.
Again, more in that post. Sorry if I’m misunderstanding, and happy to keep chatting!