rhys southan comments on A Timing Problem for Instrumental Convergence

rhys southan 29 Aug 2025 8:26 UTC
1 point
0
Petr,

Thanks for this response. Wide-scope and narrow-scope don’t determine how a goal is defined. These are different theories about what is rationally required of an agent who has a goal, with respect to their goal.
I would define a goal as some end that an agent intends to bring about. Is this inconsistent with how many people here would see a goal? Or potentially consistent but underspecified?
- Petr Kašpárek 31 Aug 2025 11:32 UTC
  1 point
  0
  Parent
  As I said, I’m not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000′s thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.
  Goals in people are something different. Their goals are derived from their values.^[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people’s values. Both people’s values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I’m really aiming at are fundamental hard-to-describe beliefs like “I want my values to be consistent.”
  Overall, I’m actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn’t assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.
  1. ^
    I’m sure there is also a plethora of philosophical debate about what goals (in people) really are and how they are derived. Same for values.
  - rhys southan 31 Aug 2025 12:39 UTC
    3 points
    0
    Parent
    The instrumental convergence thesis doesn’t depend on being applied to a digital agent. It’s supposed to apply to all rational agents. So, for this paper, there’s no reason to assume the goal takes the form of code written into a system.
    
    There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can’t revise, by writing code in the right way. But if an AI keeps its goal because it can’t change its goal, that has nothing to do with the instrumental convergence thesis.
    If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I’d argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.
    
    The paper doesn’t take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn’t matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won’t be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it’s trivial to show that ends-rationality could lead to goal change.
    
    In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper’s argument is the human hasn’t behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.
    This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human’s well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human’s well-being has lowered. With respect to prudential value, the human has made a mistake.
    If an AI does not have well-being, abandoning a goal can’t lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.
    - rhys southan 31 Aug 2025 12:59 UTC
      1 point
      0
      Parent
      It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper’s argument is that goal abandonment is also possible without this route.