rhys southan comments on A Timing Problem for Instrumental Convergence

rhys southan 2 Sep 2025 9:05 UTC
1 point
0
I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi’s anti-murder goal and the egoist’s self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn’t mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist’s goal of self-indulgence from an AI’s goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that’s the case.
Imagine there are two people. One is named Ally. She’s an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, “Ally and Egon should both protect their respective terminal goals,” I would need an explanation for this, and I doubt I would agree with whatever that explanation is.
Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
- Fabien Roger 2 Sep 2025 14:20 UTC
  2 points
  0
  Parent
  And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?
  I meant sth like that (though weaker, I didn’t want to claim all goals are like that), though I don’t claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe “terminal preferences” more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using “terminal goals” to refer to objects and the meta-stuff.)
  I don’t know how to call the meta-stuff, it’s a bit sad I don’t know a good word for it.
  With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say “it’s plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
  I don’t love calling it “self-preserving terminal preferences” though because it feels tautological when in fact self-preserving terminal preferences are natural and don’t need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.
  - rhys southan 2 Sep 2025 14:48 UTC
    1 point
    0
    Parent
    “It’s plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
    This strikes me as plausible. The paper has a narrow target. It’s arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn’t expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You’re making that kind of case here without appealing to instrumental convergence.
    The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won’t be, but again, the instrumental convergence thesis doesn’t want to assume that.