The bad news is: I strongly expect ASI to have some consequentialist preferences—see my post “Thoughts on Process-Based Supervision” §5.3. The good news is, I think it’s possible for ASI to also have non-consequentialist preferences.
Is that’s to say that you expects the AI to have preferences not just over the state of the world, but also over kinds of strategies and plans it takes to get there? eg they could have preferences for things like “being honest” or “making use of plans that involve an exponential increase in power (instead of some other curve-shape)”?
Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.
For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is something about the state of the world, not directly about the ASI’s strategies and plans, IMO.
(Also, “expect” is not quite right, I was just saying that I don’t find a certain argument convincing, not that this definitely isn’t a problem. I’m still pretty unsure. And until I have a concrete plan that I expect to work, I am very open-minded to the possibility that finding such a plan is (even) harder than I realize.)
Is that’s to say that you expects the AI to have preferences not just over the state of the world, but also over kinds of strategies and plans it takes to get there? eg they could have preferences for things like “being honest” or “making use of plans that involve an exponential increase in power (instead of some other curve-shape)”?
Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.
For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is something about the state of the world, not directly about the ASI’s strategies and plans, IMO.
(Also, “expect” is not quite right, I was just saying that I don’t find a certain argument convincing, not that this definitely isn’t a problem. I’m still pretty unsure. And until I have a concrete plan that I expect to work, I am very open-minded to the possibility that finding such a plan is (even) harder than I realize.)