Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.