Alex Flint comments on Instrumental convergence in single-agent systems

Alex Flint 14 Oct 2022 14:24 UTC
LW: 8 AF: 4
0
AF
Thanks for doing these experiments and writing this up. It’s so good to have concrete proposals and numerical experiments for concepts like power because power as a concept is super central to alignment, and concrete proposals and numerical experiments move the discourse around these concepts forward.

There is negotiating tactic in which one side makes a strong public pre-commitment not to accept any deal except one that is extremely favorable to them. So e.g. if Fred is purchasing a used car from me and realizes that both of us would settle for a sale price anywhere between $5000 and $10,000, then he might make a public pre-commitment not to purchase the car for more than $5000. Assuming that the pre-commitment is real and that I can independently verify that it is real, my best move then is really to sell the car for $5000. It seems like in this situation Bob has decreased his optionality pretty significantly (he no longer has the option of paying more than $5000 without suffering losses), but increased his power (he has kind of succeeded in out-maneuvering me).

A second thought experiment: in terms of raw optionality, isn’t it the case that a person really can only decrease in power over the course of their life? Since our lives are finite, every decision we make locks us into something that we weren’t locked into before. Even if there are certain improbably accomplishments that, when attained, increase our capacity to achieve goals so significantly that this outweighs all the options that were cut off, still wouldn’t it be the case that babies would have more “power” than adults according to the optionality definition?

A final example: why should we average over possible reward functions? A paperclip maximizer might be structured in a way that makes it extremely poorly suited to any goal except for paperclip maximization, and yet a strongly superhuman paperclip maximizer would seem to be “powerful” by the common usage of that word.

Interested in your thoughts.
What links here?
- POWERplay: An open-source toolchain to study AI power-seeking by Edouard Harris (24 Oct 2022 20:03 UTC; 29 points)
- Edouard Harris 14 Oct 2022 18:37 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Thanks for you comment. These are great questions. I’ll do the best I can to answer here, feel free to ask follow-ups:
  1. On pre-committing as a negotiating tactic: If I’ve understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?
    
    It won’t surprise you that I think the answer is highly context-dependent, and that I’m not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of “sacrificial” strategies in our experiments!
    
    You can check out the example in Part 3 if you’re interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent’s options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents’ POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H’s options in this case. But crucially, Agent A values the option of being able to constrain Agent H’s options.
    
    In the language of your negotiating example, there is instrumental value in preserving one’s option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context.
  2. On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you’d need would be to define a “death” state, and set your transition dynamics so that the agent gets sent to the “death” state after N turns and can never escape from it afterwards. I think this would be a very interesting experiment to run, in fact.
  3. On paperclip maximizers: This is a very deep and interesting question. One way to think about this schematically might be: a superintelligent paperclip maximizer will go through a Phase One, in which it accumulates its POWER; and then a Phase Two in which it spends the POWER it’s accumulated. During the accumulation phase, the system might drive towards a state where (without loss of generality) the Planet Earth is converted into a big pile of computronium. This computronium-Earth state is high-POWER, because it’s a common “way station” state for paperclip maximizers, thumbtack maximizers, safety pin maximizers, No. 2 pencil maximizers, and so on. (Indeed, this is what high POWER means.)
    
    Once the system has the POWER it needs to reach its final objective, it will begin to spend that POWER in ways that maximize its objective. This is the point at which the paperclip, thumbtack, safety pin, and No. 2 pencil maximizers start to diverge from one another. They will each push the universe towards sharply different terminal states, and the more progress each maximizer makes towards its particular terminal state, the fewer remaining options it leaves for itself if its goal were to suddenly change. Like a male praying mantis, a maximizer ultimately sacrifices its whole existence for the pursuit of its terminal goal. In other words: zero POWER should be the end state of a pure X-maximizer!^[1]
    
    My story here is hypothetical, but this is absolutely an experiment on can do (at small scale, naturally). The way to do it would be to run several rollouts of an agent, and plot the POWER of the agent at each state it visits during the rollout. Then we can see whether most agent trajectories have the property where their POWER first goes up (as they, e.g., move to topological junction points) and then goes down (as they move from the junction points to their actual objectives).
  Thanks again for your great questions. Incidentally, a big reason we’re open-sourcing our research codebase is to radically lower the cost of converting thought experiments like the above into real experiments with concrete outcomes that can support or falsify our intuitions. The ideas you’ve suggested are not only interesting and creative, they’re also cheaply testable on our existing infrastructure. That’s one reason we’re excited to release it!
  1. ^
    Note that this assumes the maximizer is inner aligned to pursue its terminal goal, the terminal goal is stable on reflection, and all the usual similar incantations.