Jeremy Gillen comments on Is instrumental convergence a thing for virtue-driven agents?

Jeremy Gillen 2 Apr 2025 13:46 UTC
8 points
0
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It’s just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.
I think you’ve hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting “in service” of the outer loop, then we could probably use the same technique to make a Task-based AGI acting “in service” of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.
agents with many different goals will end up pursuing the same few subgoals, which includes things like “gain as much power as possible”.
Obvious nitpick: It’s just “gain as much power as is helpful for achieving whatever my goals are”. I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.^[1]
But what if we get AIs that aren’t pure consequentialists, for example because they’re ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
[...]
Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn’t going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it’s giving nice easy bounded tasks, then no, there’s no reason to expect it to take over the world as a sub-task.
But since we ultimately want the AGI to be useful for avoiding takeover from other AGIs, it’s likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.
1. ^
  I’m assuming soft optimisation here. Without soft optimisation, there’s an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.
- mattmacdermott 2 Apr 2025 15:23 UTC
  4 points
  0
  Parent
  
  I think you’ve hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting “in service” of the outer loop, then we could probably use the same technique to make a Task-based AGI acting “in service” of us.
  
  Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I’m not sure whether it implies that you should be able to make a task-based AGI.
  
  Obvious nitpick: It’s just “gain as much power as is helpful for achieving whatever my goals are”. I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.
  
  Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn’t be scary). But I suppose you’d say that’s just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here).
  
  it’s likely that some of the tasks will be difficult and unbounded
  
  I’m not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it’s on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
  - Jeremy Gillen 2 Apr 2025 16:09 UTC
    2 points
    0
    Parent
    I’m not sure whether it implies that you should be able to make a task-based AGI.
    Yeah I don’t understand what you mean by virtues in this context, but I don’t see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it’s different then we might communicate better.
    (Later you mention unboundedness too, which I think should be added to difficulty here)
    By unbounded I just meant the kind of task where it’s always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn’t intend it as a different thing from difficulty, it’s basically the same.
    I’m not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it’s on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
    True, but I don’t think the virtue part is relevant. This applies to all instrumental goals, see here (maybe also the John-Max discussion in the comments).
- StanislavKrym 2 Apr 2025 14:10 UTC
  −1 points
  0
  Parent
  As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that “In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur”. This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur’s harm. Or that it failed to understand the taboo’s meaning.
  UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?
  - mattmacdermott 2 Apr 2025 15:27 UTC
    2 points
    0
    Parent
    I think this generalises too much from ChatGPT, and also reads to much into ChatGPT’s nature from the experiment, but it’s a small piece of evidence.
    - StanislavKrym 2 Apr 2025 23:32 UTC
      1 point
      0
      Parent
      It’s not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of alignment of the AI with relation to this fact.