rhys southan

Karma: 1

rhys southan 2 Sep 2025 14:48 UTC
1 point
0
in reply to: Fabien Roger’s comment on: A Timing Problem for Instrumental Convergence
“It’s plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don’t have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover.”
This strikes me as plausible. The paper has a narrow target. It’s arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn’t expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You’re making that kind of case here without appealing to instrumental convergence.
The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won’t be, but again, the instrumental convergence thesis doesn’t want to assume that.

rhys southan 2 Sep 2025 9:05 UTC
1 point
0
in reply to: Fabien Roger’s comment on: A Timing Problem for Instrumental Convergence
I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi’s anti-murder goal and the egoist’s self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn’t mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist’s goal of self-indulgence from an AI’s goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that’s the case.
Imagine there are two people. One is named Ally. She’s an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, “Ally and Egon should both protect their respective terminal goals,” I would need an explanation for this, and I doubt I would agree with whatever that explanation is.
Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?

rhys southan 31 Aug 2025 17:19 UTC
1 point
0
in reply to: Søren Elverlin’s comment on: A Timing Problem for Instrumental Convergence
Ha, yes, fair enough

rhys southan 31 Aug 2025 15:11 UTC
3 points
0
in reply to: Fabien Roger’s comment on: A Timing Problem for Instrumental Convergence
“When an entity ‘cares about X like ghandi cares about avoiding murder’ or ‘cares about X like a pure egoist cares about his own pleasure’ I would call that ‘having X as terminal goal.’”
I think I would agree with this, unless you would also claim that “caring about X like a pure egoist cares about his own pleasure” is the only way of having a terminal goal. I would define a terminal goal more broadly as a non-instrumental goal: a goal pursued for its own sake, not for anything else. How a pure egoist cares about his own pleasure might have particular features that some non-instrumental goals might not have. I would still say these latter types of non-instrumental goals are terminal goals.
“Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values?”
No, the paper does not assume moral realism. The point about moral realism in the paper is just this: an agent believing that bringing about X is wrong might have a reason not to change their goals in a way that will cause them to later do X, but the instrumental convergence thesis doesn’t assume moral realism, so arguments in favor of goal preservation can’t assume moral realism either.

I agree that even if moral realism is true, a pure egoist might want to stay a pure egoist.
“I don’t think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of “caring about X” rather than about the more shallow kind of goals you are talking about. I don’t buy these arguments but I think this is where the reasonable disagreement is and redefining “terminal goal” to mean sth weaker than “cares about X like Ghandi cares about murder” is not helpful.”
This part makes me think you are adopting a more restrictive notion of terminal goals than I would. What’s wrong with non-instrumental goals as the definition of a terminal goal? One reason for adopting the broader definition is that we don’t know what a superintelligence will be like, so we don’t want to assume it will care about things in a human-like way.
“Any insight into how to build AIs that don’t care about anything in the same way that Gandhi cares about murder?”
I haven’t thought about how to create a system that has what you call “shallow” goals. It just seems to me that non-instrumental goals can, in principle, take this “shallow” form, especially for agents who (by stipulation) might not have hedonic sensations.

rhys southan 31 Aug 2025 14:02 UTC
1 point
0
in reply to: rhys southan’s comment on: A Timing Problem for Instrumental Convergence
Maybe this is still B, in which case I might have interpreted it more strictly than you intended.

rhys southan 31 Aug 2025 13:53 UTC
1 point
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
They could be using their current goal to evaluate the future, but include in the future that they won’t have that goal. This doesn’t require excluding this goal from their analysis all altogether. It’s just that they evaluate that the failure of this goal is irrelevant in a future in which they don’t have the goal.

rhys southan 31 Aug 2025 12:59 UTC
1 point
0
in reply to: rhys southan’s comment on: A Timing Problem for Instrumental Convergence
It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper’s argument is that goal abandonment is also possible without this route.

rhys southan 31 Aug 2025 12:54 UTC
1 point
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by “the goal” but never fails by “her goal.” Cake isn’t the best case for this. The argument for this is in 3.4 and 3.5.

rhys southan 31 Aug 2025 12:39 UTC
3 points
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
The instrumental convergence thesis doesn’t depend on being applied to a digital agent. It’s supposed to apply to all rational agents. So, for this paper, there’s no reason to assume the goal takes the form of code written into a system.

There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can’t revise, by writing code in the right way. But if an AI keeps its goal because it can’t change its goal, that has nothing to do with the instrumental convergence thesis.
If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I’d argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.

The paper doesn’t take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn’t matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won’t be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it’s trivial to show that ends-rationality could lead to goal change.

In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper’s argument is the human hasn’t behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.
This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human’s well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human’s well-being has lowered. With respect to prudential value, the human has made a mistake.
If an AI does not have well-being, abandoning a goal can’t lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.

rhys southan 31 Aug 2025 12:11 UTC
1 point
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
I don’t assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything.

rhys southan 31 Aug 2025 10:35 UTC
0 points
0
in reply to: rhys southan’s comment on: A Timing Problem for Instrumental Convergence
The reason I suspect you haven’t is that whether an agent is “myopic” or not is irrelevant to the argument. Where we may disagree is over the nature of goal having, as Seth Herd pointed out. If you want to find a challenge to the argument, that’s the place to look.

rhys southan 31 Aug 2025 10:23 UTC
−1 points
0
in reply to: Søren Elverlin’s comment on: A Timing Problem for Instrumental Convergence
Have you read the paper?

rhys southan 31 Aug 2025 9:26 UTC
1 point
0
in reply to: Søren Elverlin’s comment on: A Timing Problem for Instrumental Convergence
The timing problem is not a problem for agents. It’s a problem for the claim that goal preservation is instrumentally required for rational agents. The timing problem doesn’t force agents to take any particular decision. The argument is that it’s not instrumentally irrational for a rational agent to abandon its goal. It isn’t about any specific utility functions, and it isn’t a prediction about what an agent will do.

rhys southan 29 Aug 2025 8:32 UTC
1 point
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
I wasn’t aware of this paper before you linked it here, but I looked at it now. I’m not sure how well I follow the Theorem 14 section, but if the title of the theorem (“Hedonistic agents self-modify”) is anything to go by, our arguments are different. Our argument is not about hedonistic agents, and we’re not claiming that AIs will self-modify. Our point is just that it would not be instrumentally irrational of the AIs to change their goals, if they did.

rhys southan 29 Aug 2025 8:26 UTC
1 point
0
in reply to: Petr Kašpárek’s comment on: A Timing Problem for Instrumental Convergence
Petr,

Thanks for this response. Wide-scope and narrow-scope don’t determine how a goal is defined. These are different theories about what is rationally required of an agent who has a goal, with respect to their goal.
I would define a goal as some end that an agent intends to bring about. Is this inconsistent with how many people here would see a goal? Or potentially consistent but underspecified?

rhys southan 31 Jul 2025 21:31 UTC
1 point
0
in reply to: Seth Herd’s comment on: A Timing Problem for Instrumental Convergence
Just to clarify, I did change the argument after meeting with my supervisor, which is reflected in the final published draft. He said he didn’t think timing was the issue, so I figured out a better way to word the argument. The things I didn’t change were the argument title and paper title. I left “the timing problem” as the name for those, even though the argument wasn’t as obviously about timing anymore, because I thought changing the names might be a problem (and it’s entirely possible I was wrong about that). I thought that’s what you had noticed: that the name of the argument didn’t fully suit the argument itself. That’s because I changed the argument but not the name of it.

rhys southan 31 Jul 2025 17:07 UTC
2 points
0
in reply to: Seth Herd’s comment on: A Timing Problem for Instrumental Convergence
I’m glad you got more out of the argument after reading those sections. I agree that “timing problem” is not the best description for the argument. Calling it “the timing problem” was a relic of an earlier version of the argument that was more about timing. After submitting the paper and getting a revise and resubmit from the journal, I got some feedback from my supervisor that made me realize timing wasn’t the real issue. So, I changed the argument to make it less timing-based. However, I worried that changing the title of the argument and of the paper for a resubmission might disqualify the paper for further consideration (because it might count as a different paper at that point). Maybe that was overly conservative. I might have risked coming up with a new name for the argument and paper if I thought the name of the argument would deter people from reading the paper.

As for your critique after reading the sections, you’ve picked up on the issue I know I need to elaborate on! I’m working on a paper that is precisely about this, and have been since late last year. If you don’t mind, I might reach out to you for feedback once I have a finished draft.
I appreciate the feedback about the abstract, but the abstract is set in stone at this point.

rhys southan 31 Jul 2025 16:19 UTC
1 point
0
in reply to: Seth Herd’s comment on: A Timing Problem for Instrumental Convergence
No need to read the whole paper. Section 3.4 is meant to respond to this sort of objection. You could skip to that.

rhys southan 31 Jul 2025 16:16 UTC
1 point
0
in reply to: Seth Herd’s comment on: A Timing Problem for Instrumental Convergence
Yeah, read sections 3.4 and 3.5. These are meant to address your objection here. Especially 3.4. You’re making what we call “the delay objection.”

rhys southan 31 Jul 2025 15:21 UTC
−1 points
0
in reply to: Seth Herd’s comment on: A Timing Problem for Instrumental Convergence
“How it could not be instrumental to preserve any goal...”
The argument is not that goal preservation isn’t instrumentally useful for achieving a goal. Preserving a goal normally increases the probability of achieving the goal. So, preserving a goal can be instrumentally rational, and usually is. The argument is just that abandoning a goal is not instrumentally irrational; instrumental rationality doesn’t prohibit it. Abandoning goal X makes you worse at achieving goal X, yes, but that doesn’t matter instrumentally, because once you abandon goal X, it’s not your goal anymore, so instrumental rationality doesn’t require taking the means to achieve goal X. It’s “the timing problem” because there’s no point at which abandoning a goal is instrumentally irrational.
There are objections to this, but I won’t rehash those here, since this is a linkpost. I’d be interested in your take after you’ve read section 3, if you have a chance to look at it.