I precisely describe this aspect of their research in the very beginning
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It’s also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn’t particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).