Daniel Kokotajlo comments on Exploring Mild Behaviour in Embedded Agents

Daniel Kokotajlo 1 Jul 2022 1:07 UTC
LW: 2 AF: 2
0
AF
[I only skimmed this, sorry, but I figure you’d rather have some crappy feedback than none]
1. How does this relate to speed prior and stuff like that?
2. If the agent figures out how to build another agent that is basically a copy of itself but without the efficient thinking constraints… won’t it do so? Because the desire of thinking efficiently is only a desire for it to think efficiently, not for its children to think efficiently… And if we want to construct it in such a way that it desires for it and all the agents it creates or helps create to think efficiently… that seems hard, and also maybe like it’ll limit capabilities. But doable and maybe safe I guess.
- Megan Kinniment 1 Jul 2022 20:37 UTC
  LW: 3 AF: 1
  0
  AF Parent
  1. How does this relate to speed prior and stuff like that?
  I list this in the concluding section as something I haven’t thought about much but would think about more if I spent more time on it.
  2. If the agent figures out how to build another agent...
  Yes, tackling these kinds of issues is the point of this post. I think efficient thinking measures would be very difficult / impossible to actually specify well, and I use compute usage as an example of a crappy efficient thinking measure. The point is that even if the measure is crap, it might still be able to induce some degree of mild optimisation, and this mild optimisation could help protect the measure (alongside the rest of the specification) from the kind of gaming behaviour you describe. In the ‘Potential for Self-Protection Against Gaming’ section, I go through how this works when an agent with a crap efficient thinking measure has the option to perform a ‘gaming’ action such as delegating or making a successor agent.