training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism)
Sorry to be rude, but have you read this post? I precisely describe this aspect of their research in the very beginning: (Though to be clear, them actually training the AI purposefully to be good makes this not quite hyperstition (I’d say).)
The post explicitly notes that this works better than training on stories where an AI behaves admirably– which appears more related to (positive) hyperstition.
With the thing that works better being:
training the model on reasoning traces– generated by reflecting on its constitution while giving users ethical advice on difficult dilemmas
Regarding:
I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining
But that is what the precise literal claim is about they are making in the tweet (and some of the other references). That’s the bailey. The motte is that pre-training on characters has some effects on AI behavior, which i don’t dispute.
Regarding
classical alignment theory was largely invented before we had access to modern LLMs
To be clear, there is a whole separate post about this to be written about whether alignment theory still applies to LLMs and I am ideating on writing a story comparing LLMs to a bee hive and bees to personas. But in this section I clearly jsut point out this observation without valence, which seems by itself noteworthy. But there seem to be some upset people sort of triangulating my opinion on factual statements and commenting/downvoting based on that.
I precisely describe this aspect of their research in the very beginning
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It’s also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn’t particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).
Sorry to be rude, but have you read this post? I precisely describe this aspect of their research in the very beginning: (Though to be clear, them actually training the AI purposefully to be good makes this not quite hyperstition (I’d say).)
With the thing that works better being:
Regarding:
But that is what the precise literal claim is about they are making in the tweet (and some of the other references). That’s the bailey. The motte is that pre-training on characters has some effects on AI behavior, which i don’t dispute.
Regarding
To be clear, there is a whole separate post about this to be written about whether alignment theory still applies to LLMs and I am ideating on writing a story comparing LLMs to a bee hive and bees to personas. But in this section I clearly jsut point out this observation without valence, which seems by itself noteworthy. But there seem to be some upset people sort of triangulating my opinion on factual statements and commenting/downvoting based on that.
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).