Seth Herd comments on A Timing Problem for Instrumental Convergence

Seth Herd 30 Jul 2025 21:54 UTC
1 point
0
This is an important consideration, so if it were false, that would be important. The research I’m working on does have the assumption of goal preservation as an instrumentally convergent subgoal.

Based on your description in the abstract, I don’t understand. How it could not be instrumental to preserve any goal based on timing? Suppose I have a goal right now. Changing it or deliberately allowing it to change would be irrational, because doing so will prevent me from reaching that goal. Right? Of course if I change my goal by accident, I no longer have the old goal, and am not rationally bothered by that after the accidental change. Is that the timing perspective you’re referring to? If so I don’t see the relevance for alignment. If not, what timing are you referring to?

What gives?
- rhys southan 30 Jul 2025 22:28 UTC
  1 point
  0
  Parent
  Yes, the article is about intentional goal change, not accidental goal change. Section 3 of the article addresses objections to the main argument. If you don’t want to read the whole article, you could skip to that section. If you don’t want to read all of section 3, I would suggest reading section 3.1 for the summary of the main argument, then skip to section 3.4 for the “delay objection” and 3.5 for the “goal-first objection.” Those are probably the most relevant subsections for you.
  - Seth Herd 31 Jul 2025 16:17 UTC
    1 point
    0
    Parent
    I hadn’t seen this comment when I wrote the above response. But it still stands.
    
    If you want people to read your article on a counterintuitive result, it seems like you really need to put the central argument in the abstract. Otherwise it’s pretty reasonable to assume that the central argument is fatally flawed.
    
    The fact that your response doesn’t answer the obvious question either makes me even less optimistic about finding that answer if I take the time to read the whole paper.
    
    If there’s just a miscommunication issue, in which there is a good answer but it’s just not in the abstract or that response, or it is but I’m missing it, I’d like to help you improve the communication of your argument.
    - rhys southan 31 Jul 2025 16:19 UTC
      1 point
      0
      Parent
      No need to read the whole paper. Section 3.4 is meant to respond to this sort of objection. You could skip to that.
      - Seth Herd 31 Jul 2025 16:41 UTC
        3 points
        0
        Parent
        As you said in the other comment, reading 3.4 and 3.5 were necessary. That’s because the actual argument you’re making is quite complex and describing it as a timing issue seems quite wrong.
        
        I think you’re envisioning a goal that is not a goal. Your imagining a goal that is not a desire. That removes the functional property we are usually thinking of a goal having. If you don’t care whether your goal is accomplished, it’s not a goal in common parlance.
        
        That is the meat of the argument. It’s about the nature of goals when separated from desires. That is relevant to the question of alignment and goals. I suspect you’ve created a contradictory assumption by thinking of a goal that doesn’t have the properties of a desire, but I’m not certain.
        
        If we have called something a goal but also defined it as not having the functional properties of a desire that cause a system to pursue something, how is that remainder still what we’d call a goal? If I tell my boss I have a goal of finishing that report but have no desire to finish it, I’d say it wasn’t really much of a goal and more of a lie. Maybe I meant I’d finish it if nothing I actually desire comes up in the meantime; then I’d say it had a little bit of desire attached to it and just not much.
        
        That’s not a full analysis and I don’t intend to do one right now. I need to get back to my work which very much assumes that it’s irrational to knowingly allow one’s goal to change. That’s assuming that the goal also has the properties of a desire; that there are functional mechanisms that will cause the system to make decisions that are estimated to cause that goal to be accomplished in the future. A decision to abandon that goal would cause it to not be accomplished, so those functional mechanisms would prevent that decision from being made.
        
        There is some interesting stuff there, but the abstract does not point to it accurately, so I maintain that just not reading the article is the sensible response to that abstract. I strongly suggest you change it.
        rhys southan 31 Jul 2025 17:07 UTC
        2 points
        0
        Parent
        I’m glad you got more out of the argument after reading those sections. I agree that “timing problem” is not the best description for the argument. Calling it “the timing problem” was a relic of an earlier version of the argument that was more about timing. After submitting the paper and getting a revise and resubmit from the journal, I got some feedback from my supervisor that made me realize timing wasn’t the real issue. So, I changed the argument to make it less timing-based. However, I worried that changing the title of the argument and of the paper for a resubmission might disqualify the paper for further consideration (because it might count as a different paper at that point). Maybe that was overly conservative. I might have risked coming up with a new name for the argument and paper if I thought the name of the argument would deter people from reading the paper.
        
        As for your critique after reading the sections, you’ve picked up on the issue I know I need to elaborate on! I’m working on a paper that is precisely about this, and have been since late last year. If you don’t mind, I might reach out to you for feedback once I have a finished draft.
        I appreciate the feedback about the abstract, but the abstract is set in stone at this point.
        Seth Herd 31 Jul 2025 19:35 UTC
        3 points
        0
        Parent
        Ah yes, I suspected that the incentive structure of academic philosophy and publishing in a philosophy journal was a big part of the issue here.
        
        I’d be happy to help with your next paper if you want to talk through the ideas. I’d be less excited to contribute if you’ve already finished a draft. I feel that collaboration is more useful in the idea development stage than the polishing stage. At that point, there’s a lot of real sunk cost so outside contributions on the important parts of the argument become much less useful. Actually, I think that’s exactly what you described in being unable to really take your advisor’s good advice into account in framing this paper because it came too late in the process.
        rhys southan 31 Jul 2025 21:31 UTC
        1 point
        0
        Parent
        Just to clarify, I did change the argument after meeting with my supervisor, which is reflected in the final published draft. He said he didn’t think timing was the issue, so I figured out a better way to word the argument. The things I didn’t change were the argument title and paper title. I left “the timing problem” as the name for those, even though the argument wasn’t as obviously about timing anymore, because I thought changing the names might be a problem (and it’s entirely possible I was wrong about that). I thought that’s what you had noticed: that the name of the argument didn’t fully suit the argument itself. That’s because I changed the argument but not the name of it.
        Seth Herd 1 Aug 2025 15:09 UTC
        2 points
        0
        Parent
        Yes, I understood all of that and that’s what I was referring to.
        
        That change of argument but not title or abstract were exactly why I found the post so frustrating. The abstract didn’t actually give a good argument, because you’d changed the central argument but couldn’t change the title and didn’t change the abstract that much. I suspected that the practices and incentives of academic philosophy were somehow at fault. They were.
- rhys southan 31 Jul 2025 15:21 UTC
  −1 points
  0
  Parent
  “How it could not be instrumental to preserve any goal...”
  The argument is not that goal preservation isn’t instrumentally useful for achieving a goal. Preserving a goal normally increases the probability of achieving the goal. So, preserving a goal can be instrumentally rational, and usually is. The argument is just that abandoning a goal is not instrumentally irrational; instrumental rationality doesn’t prohibit it. Abandoning goal X makes you worse at achieving goal X, yes, but that doesn’t matter instrumentally, because once you abandon goal X, it’s not your goal anymore, so instrumental rationality doesn’t require taking the means to achieve goal X. It’s “the timing problem” because there’s no point at which abandoning a goal is instrumentally irrational.
  There are objections to this, but I won’t rehash those here, since this is a linkpost. I’d be interested in your take after you’ve read section 3, if you have a chance to look at it.
  - Seth Herd 31 Jul 2025 16:09 UTC
    2 points
    0
    Parent
    Prior to abandoning the goal, abandoning it is irrational. Allowing oneself to deliberately abandon a goal in the future is irrational. I don’t see how what you just said addresses this. I don’t see a valid argument, so I don’t want to read a whole paper on objections to that argument. Your argument addresses the times after abandoning the goal, it does not address all of the times before that at all, as far as I can see.
    
    Thus, your statement that there is no point at which it’s irrational to abandon a goal seems wrong. All actions are initiated before they happen. It’s irrational to initiate the action of abandoning a goal. Like I said, it can happen by accident, but failing to plan to prevent It is irrational.
    
    If I’m missing something here, please explain?
    - rhys southan 31 Jul 2025 16:16 UTC
      1 point
      0
      Parent
      Yeah, read sections 3.4 and 3.5. These are meant to address your objection here. Especially 3.4. You’re making what we call “the delay objection.”