TurnTrout comments on What failure looks like

TurnTrout 18 Mar 2019 23:29 UTC
LW: 5 AF: 3
0
AF
So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?
- Vlad Mikulik 27 Mar 2019 0:22 UTC
  LW: 3 AF: 2
  0
  AF Parent
  The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”
  - TurnTrout 27 Mar 2019 15:56 UTC
    LW: 2 AF: 1
    0
    AF Parent
    And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.