EJT comments on Detect Goodhart and shut down

EJT 16 Feb 2025 14:31 UTC
3 points
0
This is a cool idea.
With regards to the agent believing that it’s impossible to influence the probability that its plan passes validation, won’t this either (1) be very difficult to achieve, or else (2) screw up the agent’s other beliefs? After all, if the agent’s other beliefs are accurate, they’ll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent’s beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.
Here’s an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you’re still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.
- Jeremy Gillen 17 Feb 2025 8:19 UTC
  7 points
  0
  Parent
  With regards to the agent believing that it’s impossible to influence the probability that its plan passes validation
  This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn’t want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn’t mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.
  - EJT 17 Feb 2025 10:44 UTC
    3 points
    0
    Parent
    Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.
    Here’s why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can’t lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.
    I’m not sure the medical test is a good analogy. I don’t mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent’s goal is to do what we humans really want. And that’s something we can’t assume, given the difficulty of alignment.
    - Jeremy Gillen 17 Feb 2025 12:45 UTC
      2 points
      0
      Parent
      In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?
      We can’t reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn’t get the desired behaviour. Maybe you have a particular translation method in mind?
      I don’t mess up the medical test because true information is instrumentally useful to me, given my goals.
      Yep that’s what I meant. The goal u is constructed to make information about h instrumentally useful for achieving u, even if g is poorly specified. The agent can prefer h over ~h or vice versa, just as we prefer a particular outcome of a medical test. But because of the instrumental (information) value of the test, we don’t interfere with it.
      I think the utility indifference genre of solutions (which try to avoid preferences between shutdown and not-shutdown) are unnatural and create other problems. My approach allows the agent to shutdown even if it would prefer to be in the non-shutdown world.