Hmm, I’m in favor of an immediate stop (and of people being more honest about their beliefs) but in my experience the lying / hiding frame doesn’t actually describe many people.
This is maybe even harsher than what you said in some ways, but to me it feels more like even very bright alignment researchers are often confused and getting caught in shell games with alignment, postulating that we’ll be able to build “human level” AI, which somehow just doesn’t do a bunch of bad things that smart humans are clearly capable of. And if even the most technical people are confused when talking to each other, I wouldn’t expect leadership of big labs to do better when talking to the public, even if they’re being scrupulously honest about their own beliefs.
Once labs start to reach models that pose a potential takeover risk, they either:
Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
There would probably be even more genuine confusion / disagreement on the topic, but I think talking openly about much stricter unpause conditions would be good. I think “solve mechanistic interpretability” and “pass (both kinds of) evals” is not really close to sufficient.
My criteria for an unpause would look more like (just an example to give flavor, not meant to be realistic / well thought-out):
There’s broad consensus and understanding about which object-level things (shutdown problem, embedded agency, etc.) are actually relevant to alignment, and what the shape of solutions even looks like.
Research on topics like “Human wanting” are filled with precise math and diagrams and gears-level maps to neuroscience. (But the math and diagrams and concepts themselves should be completely independent of the neuroscience; links to human neuroscience are checksums for validity, not crutches for understanding.)
We have complete solutions to things like the shutdown problem. (I don’t think “build in a shutdown button via giving the AI invulnerable incomplete preferences” should be part of anyone’s actual alignment plan, but it should be the kind of thing we know how to do before scaling.)
I think if everyone already saw why mechanistic interpretability is insufficient, they could also see why it would be better to just institute a pause now. But they don’t, so they continue to push for scaling in a genuine pursuit of deconfusion. Not great! But also not exactly dishonest or even inaccurate; no one has actually figured out a better way to reliably deconfuse people so far, and scaling further does seem likely to actually work for that, one way or another.
Hmm, I’m in favor of an immediate stop (and of people being more honest about their beliefs) but in my experience the lying / hiding frame doesn’t actually describe many people.
This is maybe even harsher than what you said in some ways, but to me it feels more like even very bright alignment researchers are often confused and getting caught in shell games with alignment, postulating that we’ll be able to build “human level” AI, which somehow just doesn’t do a bunch of bad things that smart humans are clearly capable of. And if even the most technical people are confused when talking to each other, I wouldn’t expect leadership of big labs to do better when talking to the public, even if they’re being scrupulously honest about their own beliefs.
My biggest issue with e.g. RSPs are pauses done right is actually the proposed unpause condition:
There would probably be even more genuine confusion / disagreement on the topic, but I think talking openly about much stricter unpause conditions would be good. I think “solve mechanistic interpretability” and “pass (both kinds of) evals” is not really close to sufficient.
My criteria for an unpause would look more like (just an example to give flavor, not meant to be realistic / well thought-out):
There’s broad consensus and understanding about which object-level things (shutdown problem, embedded agency, etc.) are actually relevant to alignment, and what the shape of solutions even looks like.
Research on topics like “Human wanting” are filled with precise math and diagrams and gears-level maps to neuroscience. (But the math and diagrams and concepts themselves should be completely independent of the neuroscience; links to human neuroscience are checksums for validity, not crutches for understanding.)
We have complete solutions to things like the shutdown problem. (I don’t think “build in a shutdown button via giving the AI invulnerable incomplete preferences” should be part of anyone’s actual alignment plan, but it should be the kind of thing we know how to do before scaling.)
I think if everyone already saw why mechanistic interpretability is insufficient, they could also see why it would be better to just institute a pause now. But they don’t, so they continue to push for scaling in a genuine pursuit of deconfusion. Not great! But also not exactly dishonest or even inaccurate; no one has actually figured out a better way to reliably deconfuse people so far, and scaling further does seem likely to actually work for that, one way or another.