Petr Kašpárek comments on A Timing Problem for Instrumental Convergence

Petr Kašpárek 31 Aug 2025 11:59 UTC
1 point
0
I’m looking more closely at the Everrit et al. paper and I’m less sure I actually understood you. Everrit et al.’s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.
- rhys southan 31 Aug 2025 12:11 UTC
  1 point
  0
  Parent
  I don’t assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything.
  - Petr Kašpárek 31 Aug 2025 12:32 UTC
    1 point
    0
    Parent
    Then I don’t really understand your argument.
    As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.
    Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:
    Preserve the goal. Tomorrow I will be presented a cake and eat it. Number of failures: 0
    Abandon the goal. Tomorrow I will be presented a cake and fail to it. Number of failures: 1
    Ronya preserves the goal.
    - rhys southan 31 Aug 2025 12:54 UTC
      1 point
      0
      Parent
      The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by “the goal” but never fails by “her goal.” Cake isn’t the best case for this. The argument for this is in 3.4 and 3.5.
      - Petr Kašpárek 31 Aug 2025 13:33 UTC
        1 point
        0
        Parent
        You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it
        Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it.
        rhys southan 31 Aug 2025 13:53 UTC
        1 point
        0
        Parent
        They could be using their current goal to evaluate the future, but include in the future that they won’t have that goal. This doesn’t require excluding this goal from their analysis all altogether. It’s just that they evaluate that the failure of this goal is irrelevant in a future in which they don’t have the goal.
        rhys southan 31 Aug 2025 14:02 UTC
        1 point
        0
        Parent
        Maybe this is still B, in which case I might have interpreted it more strictly than you intended.