There’s a lot, off the top of my head: LASR, MARS, Pivotal, SPAR
Neel Nanda
Models May Behave Worse When Eval Aware
Building Better Activation Oracles
Eliezer lists “OpenPhil-funded groups” as part of who he is criticising. The people habryka quotes typically fit that demographic better than unbridled capabilities
Great post, I like the level of helpful detail. This advice seems pretty reasonable to me, thanks for writing!
Thanks for the edit! I still disagree about vaguely, but the new sentence seems much more reasonable to me
I precisely describe this aspect of their research in the very beginning
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It’s also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn’t particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don’t think this is true and I don’t think they should have written it like this.
Have you read the research? They have a section called “WHY DOES AGENTIC MISALIGNMENT HAPPEN?” which talks through various hypotheses and provides evidence, eg that training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism) and that this effect persists after further alignment training (suggesting post training is not overpowering a pertaining prior). I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining, I bet that things like a propensity to role play, and the scenario being super contrived and having a bunch of Chekhov’s guns that only make sense if you blackmail, were also big effects. But they do provide helpful evidence that pretraining is a big factor and it’s a plausible hypothesis
This seems like a departure from classical alignment theory.
Sure, but this seems fine to me, classical alignment theory was largely invented before we had access to modern LLMs, so you should expect it to be missing a lot of important stuff. I think the persona selection model seems plausible and big if true and explains anomalies like emergent misalignment much better than classical alignment theory. I have generally been impressed with how well things like them focusing on character training seem to have done for Claude’s alignment though I do also agree that people at anthropic seem too often underrate power seeking misalignment risk and it’s hard to forecast how well theories about current models will last for future models
Strangely, the research they use as evidence actually doesn’t seem to be related to hyperstition at all?
It’s a perfectly reasonable thing to discuss. The post is about how they fix an issue. Hyperstition is their hypothesis for where the issue came from. It’s important because if you think that the issue comes from bad post-training, then maybe there might be simpler fixes that look more like identifying and fixing the bad post-training data.
More generally, this post feels like it’s overcomplicating things. I think that various Anthropic researchers just believe that hyperstition is real and mention it occasionally because they think it’s real. I’d bet its somewhat real, and does meaningfully increase the probability of misalignment though is not the sole factor. I think the take that people should stop talking about AI posing risks is dumb, and I agree that if this is a real threat model the onus is on the AI labs to filter out the data.
When I see this dynamic, it seems to typically be that Alice is implicitly saying that the thing is easy to fix and that there is some deep incompetence from the thing being broken. When I am Bob I’m trying to explain why the thing being fixed is actually really hard and it’s downstream with a bunch of complexity Alice probably wasn’t thinking about. I think this is actually extremely relevant data when the difficulty of fixing the thing is important!
If Alice just wants emotional support and to complain about things being sad, without any orientation towards fixing it or the difficulty of the problem, then I agree with this post.
I’m not saying that those people believe it is a critical first try problem. I expect they agree that it could be a critical first try problem, but that they predict it probably isn’t for variety of reasons and they view Eliezer as claiming that it is rather than just that it is possible it’s a critical first try problem
Sure, but then an attempt to summarise Eliezer’s position, which attributes a much stronger position than is in this post, isn’t necessarily a straw man that doesn’t understand the point of irretrievability, but can merely be responding to all his points on top of irretrievability, or saying that they don’t consider him to be making sufficient arguments beyond the potential for irretrievability
My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer’s position).
In my opinion, this post is not actually making many hard claims. I mostly view it as gesturing at the existence of really difficult problems and presenting historical analogies. It argues that it is possible for problems to be very hard, even if they have a bunch of other nice properties, including the nice properties people attribute to the AI problem. However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.
Why does being public change any of this? They already have a ton of investors
Is exobrain something you built or a service you pay for? I’d be curious to try it but couldn’t see a link
What’s station 4 about?
Test your best methods on our hard CoT interp tasks
+1, especially with the vast majority of future Anthropic employee donations already locked in to DAFs
Gotcha. I would feel reasonably happy if the policy said “text written or dictated by a human”, if we count my level of LLM editing followed by me editing to be overall light editing
I doubt DOI requesting is important, you just need a reasonable bibtex with the title, link and authors. It is a bit awkward that sometimes the author list is LW usernames rather than actual names, but that seems hard to solve