Interesting pitch. How well does it work with 2 people?
Neel Nanda
LLM-Driven Feature Discovery
Yeah, I liked teaching Claude why, and found it had sufficient detail to replicate. I would suggest editing the top level post to remove it as an example
How transparent is DiffusionGemma (and why it matters)
Not to my knowledge. Seems like an interesting resource to create!
Not easily, sorry
Synthetic document finetuning for instilling positive traits
Why Do Naive SFT Filters For Safety Properties Fail?
SFT Drives Gemini’s Safety Properties
Building and evaluating model diffing agents
Models May Behave Worse When Eval Aware
Building Better Activation Oracles
I doubt DOI requesting is important, you just need a reasonable bibtex with the title, link and authors. It is a bit awkward that sometimes the author list is LW usernames rather than actual names, but that seems hard to solve
There’s a lot, off the top of my head: LASR, MARS, Pivotal, SPAR
Eliezer lists “OpenPhil-funded groups” as part of who he is criticising. The people habryka quotes typically fit that demographic better than unbridled capabilities
Great post, I like the level of helpful detail. This advice seems pretty reasonable to me, thanks for writing!
Thanks for the edit! I still disagree about vaguely, but the new sentence seems much more reasonable to me
I precisely describe this aspect of their research in the very beginning
I disagree with your summary of their argument. There are two questions here: why does baseline Claude blackmail and how can it be fixed? The positive hypersition result acts as evidence about what causes it by showing that the pre-training prior does have a meaningful effect on the model’s behaviour, even after standard post training, and is a plausible mechanism.
Seeing that adding in specific post training data is enough to overpower the pretraining prior is unsurprising, and valuable evidence about how to fix it, but not much evidence about the cause in baseline post training. It can simultaneously be true that post-training effects can be larger than hyperstition, but that in baseline Claude hyperstition is present while significant post-training effects are not, so hyperstition is the cause but not the correct fix.
More generally I was being snarky because I think it’s unreasonable to accuse a piece of not providing evidence for one of its claims when it has a section about evidence for that claim, which I do think provides useful (though not conclusive) evidence. I think you just disagree with the evidence
Nevertheless, hyperstition does not appear in any classical theory of alignment and marks a departure from classical alignment research. It’s also all too convenient to be used by an AI lab and you should be skeptical about the motivations. Crucially, I believe hyperstition isn’t particularly relevant to superalignment, and trying to prevent it by naive means would most likely backfire. Finally, hoping the model will stay in an aligned persona seems like a bad alignment approach.
This section read to me as you presenting deviating from classical alignment theory with a negative valence and as a critique which is why I was pushing back. I do agree with it as a factual statement. If you intended it as a purely factual observation then I think we agree
People were probably left with the impression their research attached proved this point or at least provided evidence for it. I don’t think this is true and I don’t think they should have written it like this.
Have you read the research? They have a section called “WHY DOES AGENTIC MISALIGNMENT HAPPEN?” which talks through various hypotheses and provides evidence, eg that training on positive stories improves alignment and that more of them have a better effect (evidence that hyperstition is a plausible mechanism) and that this effect persists after further alignment training (suggesting post training is not overpowering a pertaining prior). I don’t think they specifically show that misaligned stories about AI is definitely the cause in pretraining, I bet that things like a propensity to role play, and the scenario being super contrived and having a bunch of Chekhov’s guns that only make sense if you blackmail, were also big effects. But they do provide helpful evidence that pretraining is a big factor and it’s a plausible hypothesis
This seems like a departure from classical alignment theory.
Sure, but this seems fine to me, classical alignment theory was largely invented before we had access to modern LLMs, so you should expect it to be missing a lot of important stuff. I think the persona selection model seems plausible and big if true and explains anomalies like emergent misalignment much better than classical alignment theory. I have generally been impressed with how well things like them focusing on character training seem to have done for Claude’s alignment though I do also agree that people at anthropic seem too often underrate power seeking misalignment risk and it’s hard to forecast how well theories about current models will last for future models
Seems like a reasonable list. I would also add that the idea of interpreting models is very aesthetically pleasing and intellectually satisfying to many smart people, which I think gives interpretability a significant advantage over other fields. I’d also guess that good coding tutorials, first from me, then Arena, were extremely impactful. More so than most things on your list.
One of the places I would start if trying to do similar technical field building in other domains, is generally good educational materials, aiming to make the field seem accessible and have people feel like there’s a bunch of things to do. And giving ways they can get started, e.g. small projects, where they feel satisfied and excited to learn more.
I think podcasts were also very high leverage (at least if you can get on a good one, my best ones got to 100K+ people. Talks are way less leveraged) though it helps a lot to have a message, exciting works to point to that can be explained to newcomers, and ideally a call to action / educational materials to point to.