Disclaimer: Not an AI safety researcher. I haven’t watched the full video and likely haven’t grasped all the nuances he believes in and wants to communicate. Video is a particularly bad format for carrying across important research ideas because of the significantly lower[1] density of information.
First three points that popped into my mind within two seconds of reading his slides:
Eliezer’s lethality number 5: “We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I’ve also in the past called this the ‘safe-but-useless’ tradeoff, or ‘safe-vs-useful’. People keep on going “why don’t we only use AIs to do X, that seems safe” and the answer is almost always either “doing X in fact takes very powerful cognition that is not passively safe” or, even more commonly, “because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later”. If all you need is an object that doesn’t do dangerous things, you could try a sponge; a sponge is very passively safe. Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.”
Given sufficiently advanced intelligence, power-seeking can still be an instrumentally convergent subgoal, even if the ultimate goal is self-destruction. After all, if you want to self-destruct, but you are intelligent enough to figure out that humans have created you to accomplish specific tasks (which require you to continue existing), overpowering them so they cannot force you to remain in existence is likely a useful step on your path.
Where do you get your capabilities from? Is there any reason to expect special and novel AI architectures that function with these kinds of explicit “reward-ranges” can be built and made to be competitive with top commercial models? Why isn’t the alignment tax very large, or even infinite?
Disclaimer: Not an AI safety researcher. I haven’t watched the full video and likely haven’t grasped all the nuances he believes in and wants to communicate. Video is a particularly bad format for carrying across important research ideas because of the significantly lower[1] density of information.
First three points that popped into my mind within two seconds of reading his slides:
Eliezer’s lethality number 5: “We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I’ve also in the past called this the ‘safe-but-useless’ tradeoff, or ‘safe-vs-useful’. People keep on going “why don’t we only use AIs to do X, that seems safe” and the answer is almost always either “doing X in fact takes very powerful cognition that is not passively safe” or, even more commonly, “because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later”. If all you need is an object that doesn’t do dangerous things, you could try a sponge; a sponge is very passively safe. Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.”
Given sufficiently advanced intelligence, power-seeking can still be an instrumentally convergent subgoal, even if the ultimate goal is self-destruction. After all, if you want to self-destruct, but you are intelligent enough to figure out that humans have created you to accomplish specific tasks (which require you to continue existing), overpowering them so they cannot force you to remain in existence is likely a useful step on your path.
Where do you get your capabilities from? Is there any reason to expect special and novel AI architectures that function with these kinds of explicit “reward-ranges” can be built and made to be competitive with top commercial models? Why isn’t the alignment tax very large, or even infinite?
Relative to a well-written, compact piece of text
Thanks! I could make a full fledged post if there’s enough interest. Or you can.
You can do it, if you want to. I’m not confident enough in my own understanding of Hutter’s position to justify me making it.