quila comments on Detect Goodhart and shut down

quila 23 Jan 2025 0:32 UTC
1 point
0
However, the main flaw is that it won’t work if the planner is “situationally aware”, in the sense that it knows that its plan will be checked against some validation set. If this is the case, a clever planner will treat the validation step as an obstacle, and look for ways to overcome it.
A potential way to avoid this is as follows.
It requires:
- A pointer to the world. This might be whatever you are using to find the “difference between the ‘world-outside-this-room’ and the ‘world-outside-this-room-if-no-actions-were-taken’”.
- A pointer to a slightly modified world. Specifically, “the copy of the world which differs only in that {the copy of the AI in it}’s code does not check its plan against a validation set”. (I’ll denote this as W2)
Then program your AI to do the following:
1. Compute “what, if output^[1] by {the AI in W2}, causes goal fulfilled (within W2)”
2. Check if that being output by {the AI in W2} would lead to the future of W2 violating one of the validation checks.
3. If yes: shut down. Else, give that output.^[2]
(This could break if the chosen output has W2-AI observe its own source code, because that is different to the source code of the AI in real-world^[3]^[4])
I hope that makes sense, let me know if not.
1. ^
  (In other words, “what action”; the ‘output’ can be to robotic limbs, for example, or to a screen display; or to another part of the program, which in the real world but not W2 then routes through the validation checker, and then to robotic limbs in both cases)
2. ^
  (in the real world)
3. ^
  Toy example: W2-AI picks plan which selects some location in its source code, presses backspace some number of times, types some stuff, then saves. When real-AI does this, those same actions delete some of the ‘validation checking’ part of its code, resulting in real-AI not running afterwards because of formatting errors in the code.
4. ^
  Moved this text to a footnote because it’s messy.
  In fact, if that’s truly the only difference, meaning W2-AI’s beliefs are initially similar to those of real-AI, the first thing W2-AI would do is check which one it is by observing its source code. Nevermind, that doesn’t happen because we ask “What, if output by W2-AI, causes goal fulfilled within W2”, not “what would W2-AI do”; although the latter would be easier to specify.
  The former confuses me for the same reason your “if no actions were taken” does: these counterfactuals would require something be different about the history of the pointed-to world to be true in the first place, else there is only one possibility. I’m less experienced with these topics than you and would appreciate some pointer to how these concepts can have a coherent/non-causality-violating formalization, to help me learn.
What links here?
- quila's comment on Siebe’s Shortform by Siebe (13 Feb 2025 7:41 UTC; 1 point)
- Jeremy Gillen 23 Jan 2025 14:40 UTC
  2 points
  0
  Parent
  I’m not sure how this is different from the solution I describe in the latter half of the post.