> In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious.
I think this is overly cynical. Demis Hassabis, Sam Altman, and Dario Amodei all signed the statement on AI risk:
”Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
They don’t talk about it all the time but if someone wants to discuss the serious threats internally, there is plenty of external precedent for them to do so.
> frontier labs are only pretending to try to solve alignment
This is probably the main driver of our disagreement. I think hands-off theoretical approaches are pretty much guaranteed to fail, and that successful alignment will look like normal deep learning work. I’d guess you feel the opposite (correct me if I’m wrong), which would explain why it looks to you like they aren’t really trying and it looks to me like they are.
By “it will look like normal deep learning work” I don’t mean it will be exactly the same as mainstream capabilities work—e.g. RLHF was both “normal deep learning work” and also notably different from all other RL at the time. Same goes for constitutional AI.
What seems promising to me is paying close attention to how we’re training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.
What are we comparing deep learning to here? Black box − 100% granted.
But for the other problems—power-seeking and emergent goals—I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we’ll have more issues, but LLMs are a very good base to start from—they’re highly capable, understand natural language, and aren’t power seeking.
In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won’t work, and 2), the theoretical advantage—that if you can precisely specify what your alignment target is, it will optimize for it—is in fact a terrible disadvantage, since you won’t be able to precisely specify your alignment target.
This is what I mean about selective cynicism! I’ve heard the exact same argument about theoretical alignment work—“mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job”—and I don’t find it convincing in either case.