rhys southan comments on A Timing Problem for Instrumental Convergence

rhys southan 31 Aug 2025 12:39 UTC
3 points
0
The instrumental convergence thesis doesn’t depend on being applied to a digital agent. It’s supposed to apply to all rational agents. So, for this paper, there’s no reason to assume the goal takes the form of code written into a system.

There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can’t revise, by writing code in the right way. But if an AI keeps its goal because it can’t change its goal, that has nothing to do with the instrumental convergence thesis.
If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I’d argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.

The paper doesn’t take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn’t matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won’t be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it’s trivial to show that ends-rationality could lead to goal change.

In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper’s argument is the human hasn’t behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.
This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human’s well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human’s well-being has lowered. With respect to prudential value, the human has made a mistake.
If an AI does not have well-being, abandoning a goal can’t lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.
- rhys southan 31 Aug 2025 12:59 UTC
  1 point
  0
  Parent
  It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper’s argument is that goal abandonment is also possible without this route.