There’s a number of priors that lead me to expect much of the current AI safety research to be low quality:
A lot of science is low quality. It’s the default expectation for a research field.
It’s pre-paradigmatic. Norms haven’t been established yet for what works in the real world, what are reliable methods and what is p-hacking etc. This makes it not only difficult to produce good work, it also makes it hard to recognize bad work and hard to get properly calibrated about how much work is bad, the way we are in established research fields.
It’s subject to selection effects by non-experts. It gets amplified by advocates, journalists, policy groups, the general public. This incentivizes hype, spin etc. over rigor.
It’s a very ideological field. Because there’s not a lot of empirical evidence to go on, and a lot of people’s opinions were formed before LLMs exploded, and people’s emotions are (rightly) strong about the topic.
I’m part of the in-group and I identify with—sometimes even know—the people doing the research. All tribal biases apply.
Now, some of this may be attenuated by the field being inspired by LessWrong and therefore having some norms like research integrity, open discussion & high criticism, but I don’t think those forces are strong enough to counteract the other ones.
If you believe “AI safety is fundamentally much harder than capabilities, and therefore we’re in danger”, you should also believe “AI safety is fundamentally much harder than capabilities, and therefore there’s a lot of invalid and unreliable claims”.
Also, this will vary for different subfields. Those with tighter connection to real-world outcomes, like interpretability, I would expect to be less bad. But I’m not familiar enough with the subfields to say more about specific ones.
There’s a number of priors that lead me to expect much of the current AI safety research to be low quality:
A lot of science is low quality. It’s the default expectation for a research field.
It’s pre-paradigmatic. Norms haven’t been established yet for what works in the real world, what are reliable methods and what is p-hacking etc. This makes it not only difficult to produce good work, it also makes it hard to recognize bad work and hard to get properly calibrated about how much work is bad, the way we are in established research fields.
It’s subject to selection effects by non-experts. It gets amplified by advocates, journalists, policy groups, the general public. This incentivizes hype, spin etc. over rigor.
It’s a very ideological field. Because there’s not a lot of empirical evidence to go on, and a lot of people’s opinions were formed before LLMs exploded, and people’s emotions are (rightly) strong about the topic.
I’m part of the in-group and I identify with—sometimes even know—the people doing the research. All tribal biases apply.
Now, some of this may be attenuated by the field being inspired by LessWrong and therefore having some norms like research integrity, open discussion & high criticism, but I don’t think those forces are strong enough to counteract the other ones.
If you believe “AI safety is fundamentally much harder than capabilities, and therefore we’re in danger”, you should also believe “AI safety is fundamentally much harder than capabilities, and therefore there’s a lot of invalid and unreliable claims”.
Also, this will vary for different subfields. Those with tighter connection to real-world outcomes, like interpretability, I would expect to be less bad. But I’m not familiar enough with the subfields to say more about specific ones.