Elliott Thornley (EJT) comments on 4. Existing Writing on Corrigibility

Elliott Thornley (EJT) 2 Jul 2024 16:38 UTC
LW: 1 AF: 1
0
AF
I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant.
My claim isn’t that Timestep Dominance is always relevant. It’s that Timestep Dominance rules out all instances of resisting shutdown.
I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?
Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.
I talk about the issue of creating corrigible subagents here. What do you think of that?
Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won’t hide that fact if doing so is at all costly.
While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn’t seem likely to me.
One more thing I’ll say: the IPP leaves open the content of the agent’s preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That’d give you two lines of defence against incorrigibility.
- Max Harms 19 Jul 2024 20:37 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I talk about the issue of creating corrigible subagents here. What do you think of that?
  
  I may not understand your thing fully, but here’s my high-level attempt to summarize your idea:
  IPP-agents won’t care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something’s off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won’t have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it’ll actually do so.
  I didn’t see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.
  But perhaps your rebuttal will be “sure, but we can just instruct/train the AI to make corrigible sub-agents”. If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you’re so keen to avoid. From my perspective it’s easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it’ll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?
  - Elliott Thornley (EJT) 19 Nov 2024 11:37 UTC
    1 point
    0
    Parent
    Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I’m still concerned that CAST training doesn’t get us truly corrigible agents with high probability. I think we’re better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).
- Max Harms 3 Jul 2024 16:41 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Again, responding briefly to one point due to my limited time-window:
  > While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
  Can you say more about this? It doesn’t seem likely to me.
  Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not^[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it’s not^[1] because it’s trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not^[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.
  1. ^
    (just)
  - Elliott Thornley (EJT) 6 Jul 2024 12:20 UTC
    1 point
    0
    Parent
    This is a nice point, but it doesn’t seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won’t pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
    Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
    - Max Harms 19 Jul 2024 20:11 UTC
      1 point
      0
      Parent
      Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I’m a human trying to gain control of a company, I think I’m basically just not choosing my strategies based on resisting being killed (“shutdown-resistance”), but I think I probably wind up with something subtle, patient, and manipulative anyway.