Towards_Keeperhood comments on Steve Byrnes’s Shortform

Towards_Keeperhood 3 Apr 2025 19:25 UTC
3 points
0
Sounds like we probably agree basically everywhere.
Yeah you can definitely mark me down in the camp of “not use ‘inner’ and ‘outer’ terminology”. If you need something for “outer”, how about “reward specification (problem/failure)”.
ADDED: I think I probably don’t want a word for inner-alignment/goal-misgeneralization. It would be like having a word for “the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions”.
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
Yeah I agree they don’t appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think.
Or more generally I think when you don’t use utility functions explicitly then capability likely suffers, though not totally sure.