Strangely, the research they use as evidence actually doesn’t seem to be related to hyperstition at all?
It’s a perfectly reasonable thing to discuss. The post is about how they fix an issue. Hyperstition is their hypothesis for where the issue came from. It’s important because if you think that the issue comes from bad post-training, then maybe there might be simpler fixes that look more like identifying and fixing the bad post-training data.
More generally, this post feels like it’s overcomplicating things. I think that various Anthropic researchers just believe that hyperstition is real and mention it occasionally because they think it’s real. I’d bet its somewhat real, and does meaningfully increase the probability of misalignment though is not the sole factor. I think the take that people should stop talking about AI posing risks is dumb, and I agree that if this is a real threat model the onus is on the AI labs to filter out the data.
I think blaming this on hyperstition in their tweet seems like a kind of strong guess, and they seem pretty confident this is the case. People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don’t think this is true and I don’t think they should have written it like this.
Some people seem to be arguing the motte position: Of course this must be behavior it picked up from pre-training, where else should it come from? This seems almost necessarily true, unless it developed this complex behavior entirely in post-training. But note that Anthropic isn’t just arguing it picked up this behavior from pre-training, but specifically from “internet text that portrays AI as evil and interested in self-preservation”. This seems like a stronger claim compared to picking this up from some other story unrelated to AI such as an upset employee who doesn’t want to get fired.
My point is then to show that this hyperstition framing shows up a lot in. Anthropic thinking and seems to be important to their view on alignment, which seems to broadly focus on creating an aligned persona for the model to play. This seems like a departure from classical alignment theory.
People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don’t think this is true and I don’t think they should have written it like this.
Have you read the research? They have a section called “WHY DOES AGENTIC MISALIGNMENT HAPPEN?” which talks through various hypotheses and provides evidence, eg that training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism) and that this effect persists after further alignment training (suggesting post training is not overpowering a pertaining prior). I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining, I bet that things like a propensity to role play, and the scenario being super contrived and having a bunch of Chekhov’s guns that only make sense if you blackmail, were also big effects. But they do provide helpful evidence that pretraining is a big factor and it’s a plausible hypothesis
This seems like a departure from classical alignment theory.
Sure, but this seems fine to me, classical alignment theory was largely invented before we had access to modern LLMs, so you should expect it to be missing a lot of important stuff. I think the persona selection model seems plausible and big if true and explains anomalies like emergent misalignment much better than classical alignment theory. I have generally been impressed with how well things like them focusing on character training seem to have done for Claude’s alignment though I do also agree that people at anthropic seem too often underrate power seeking misalignment risk and it’s hard to forecast how well theories about current models will last for future models
training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism)
Sorry to be rude, but have you read this post? I precisely describe this aspect of their research in the very beginning: (Though to be clear, them actually training the AI purposefully to be good makes this not quite hyperstition (I’d say).)
The post explicitly notes that this works better than training on stories where an AI behaves admirably– which appears more related to (positive) hyperstition.
With the thing that works better being:
training the model on reasoning traces– generated by reflecting on its constitution while giving users ethical advice on difficult dilemmas
Regarding:
I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining
But that is what the precise literal claim is about they are making in the tweet (and some of the other references). That’s the bailey. The motte is that pre-training on characters has some effects on AI behavior, which i don’t dispute.
Regarding
classical alignment theory was largely invented before we had access to modern LLMs
To be clear, there is a whole separate post about this to be written about whether alignment theory still applies to LLMs and I am ideating on writing a story comparing LLMs to a bee hive and bees to personas. But in this section I clearly jsut point out this observation without valence, which seems by itself noteworthy. But there seem to be some upset people sort of triangulating my opinion on factual statements and commenting/downvoting based on that.
I precisely describe this aspect of their research in the very beginning
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It’s also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn’t particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).
It’s a perfectly reasonable thing to discuss. The post is about how they fix an issue. Hyperstition is their hypothesis for where the issue came from. It’s important because if you think that the issue comes from bad post-training, then maybe there might be simpler fixes that look more like identifying and fixing the bad post-training data.
More generally, this post feels like it’s overcomplicating things. I think that various Anthropic researchers just believe that hyperstition is real and mention it occasionally because they think it’s real. I’d bet its somewhat real, and does meaningfully increase the probability of misalignment though is not the sole factor. I think the take that people should stop talking about AI posing risks is dumb, and I agree that if this is a real threat model the onus is on the AI labs to filter out the data.
The sentence you quote was overclaiming, adjusted.
Thanks for the edit! I still disagree about vaguely, but the new sentence seems much more reasonable to me
I had already written the research is vaguely related later, so there was also an internal consistency issue here.
I think blaming this on hyperstition in their tweet seems like a kind of strong guess, and they seem pretty confident this is the case. People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don’t think this is true and I don’t think they should have written it like this.
Some people seem to be arguing the motte position: Of course this must be behavior it picked up from pre-training, where else should it come from? This seems almost necessarily true, unless it developed this complex behavior entirely in post-training. But note that Anthropic isn’t just arguing it picked up this behavior from pre-training, but specifically from “internet text that portrays AI as evil and interested in self-preservation”. This seems like a stronger claim compared to picking this up from some other story unrelated to AI such as an upset employee who doesn’t want to get fired.
My point is then to show that this hyperstition framing shows up a lot in. Anthropic thinking and seems to be important to their view on alignment, which seems to broadly focus on creating an aligned persona for the model to play. This seems like a departure from classical alignment theory.
Have you read the research? They have a section called “WHY DOES AGENTIC MISALIGNMENT HAPPEN?” which talks through various hypotheses and provides evidence, eg that training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism) and that this effect persists after further alignment training (suggesting post training is not overpowering a pertaining prior). I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining, I bet that things like a propensity to role play, and the scenario being super contrived and having a bunch of Chekhov’s guns that only make sense if you blackmail, were also big effects. But they do provide helpful evidence that pretraining is a big factor and it’s a plausible hypothesis
Sure, but this seems fine to me, classical alignment theory was largely invented before we had access to modern LLMs, so you should expect it to be missing a lot of important stuff. I think the persona selection model seems plausible and big if true and explains anomalies like emergent misalignment much better than classical alignment theory. I have generally been impressed with how well things like them focusing on character training seem to have done for Claude’s alignment though I do also agree that people at anthropic seem too often underrate power seeking misalignment risk and it’s hard to forecast how well theories about current models will last for future models
Sorry to be rude, but have you read this post? I precisely describe this aspect of their research in the very beginning: (Though to be clear, them actually training the AI purposefully to be good makes this not quite hyperstition (I’d say).)
With the thing that works better being:
Regarding:
But that is what the precise literal claim is about they are making in the tweet (and some of the other references). That’s the bailey. The motte is that pre-training on characters has some effects on AI behavior, which i don’t dispute.
Regarding
To be clear, there is a whole separate post about this to be written about whether alignment theory still applies to LLMs and I am ideating on writing a story comparing LLMs to a bee hive and bees to personas. But in this section I clearly jsut point out this observation without valence, which seems by itself noteworthy. But there seem to be some upset people sort of triangulating my opinion on factual statements and commenting/downvoting based on that.
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
They make a bit of a mock-up of “hyperstitioning misaligned AI” in reverse: I didn’t mention the part where they also do some mock-up of post-training after training it on a bunch of concentrated positive stories.
Here are just two countertheories that also agree with that data:
It previously acted based on the Chekhov’s guns and based on the many human-human blackmail stories it had read, filling in itself as the disgruntled human employee. Directly training on a ton of positive stories just overwhelmed that.
It’s RL post training actually gave it some weak power-seeking drive and it’s world model kind of told it that the blackmail path would lead to power. Again, directly training on a ton of positive stories just overwhelmed that.
I think in some sense it does provide some evidence for their claim—it is a theory compatible with the data—but not in a sense that would be used for scientific communication. Or enough evidence that this should make them believe this claim is true (as they write in their tweet).