Fabien Roger comments on A Timing Problem for Instrumental Convergence

Fabien Roger 2 Sep 2025 14:20 UTC
2 points
0
And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
I meant sth like that (though weaker, I didn’t want to claim all goals are like that), though I don’t claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe “terminal preferences” more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using “terminal goals” to refer to objects and the meta-stuff.)
I don’t know how to call the meta-stuff, it’s a bit sad I don’t know a good word for it.
With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say “it’s plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
I don’t love calling it “self-preserving terminal preferences” though because it feels tautological when in fact self-preserving terminal preferences are natural and don’t need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
- rhys southan 2 Sep 2025 14:48 UTC
  1 point
  0
  Parent
  “It’s plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
  This strikes me as plausible. The paper has a narrow target. It’s arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn’t expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You’re making that kind of case here without appealing to instrumental convergence.
  The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won’t be, but again, the instrumental convergence thesis doesn’t want to assume that.