Can I ask why you think the AI safety community has been/will be so impactful (99% → 60%)? I think you believe the community has much more reach and power than it actually does.
Dylan Bowman
Dylan Bowman’s Shortform
That doesn’t change the fact that the pioneers really only pursued neural networks because of their similarity to the actual structure of the brain, not by first-principles reasoning about how high dimensionality and gradient descent scale well with data, size, and compute (I understand this is a high bar but this is part of why I don’t think there are any real “experts”). And in their early career especially, they were all mired in the neurological paradigm for thinking about neural networks.
Hinton, who got close to breaking free from this way of thinking when he published the backprop paper, ends it by saying “it is worth looking for more biologically plausible ways of doing gradient descent.” In fact, his 2022 forward-forward algorithm shows his approach is still tied to biological plausibility. A 2023 interview with the University of Toronto, he mentions that the reason he got concerned about superintelligence was that when working on the FF algorithm, he realized that backpropagation was just going to be better than any optimization algorithm inspired by the brain.
AI Alignment Is Turning from Alchemy Into Chemistry, but for real this time
In April 2023, Alexey Guzey posted “AI Alignment Is Turning from Alchemy Into Chemistry” where he reviewed Burns et al.’s paper “Discovering Latent Knowledge in Language Models Without Supervision.” Some excerpts to summarize Alexey’s post:
Alexey ended up being quite wrong: Burns’ paper, while very interesting, didn’t inspire impactful follow-up research in eliciting beliefs or contribute to any alignment/control techniques used at the labs.
Despite being much more optimistic about the alignment community’s ability to eventually make progress than Alexey, I did agree with him that alignment was still waiting for a killer research direction. Up to that point, and for around 2 years after, very few alignment papers actually produced insights or techniques that meaningfully affect how AI is trained and deployed. When I applied to an AI safety grantwriting role at Open Philanthropy in early 2024, one of the questions on the application was roughly “What do you think the most important alignment paper has been?” and I answered with the original RLHF paper because up until that point, it was the ~only major technique to come out of the alignment community that actually steered an AI system to behave more safely (feel free to correct me here, I’m also counting RLAIF and constitutional AI in this bucket).
But with recent work in emergent misalignment and inoculation prompting (Betley et al., MacDiarmid et al., Wichers et al.), I think alchemy really is turning into chemistry. We have:
Realistic model organisms of misalignment, unlike previous contrived examples like in the alignment faking or sleeper agents work.
Relatedly, a mechanistic model of how misalignment can arise in real, existing processes.
Repeated examples of a surprising technique that works to control real models.
I’m really excited to see new work that comes out of this research direction. I think there’s a lot of opportunity to start creating more in vitro model organisms in reward hacking setups, and more accessible model organisms mean that more researchers can contribute to creating control techniques. With more work studying the physics of how RL posttraining and reward hacking affect model goals and capabilities, there’s also more value in having evaluation techniques that can assess model alignment.