Seth Herd comments on “Behaviorist” RL reward functions lead to scheming

Seth Herd 25 Jul 2025 17:28 UTC
LW: 4 AF: 3
0
AF
Doesn’t the same argument you make for behaviorist RL failing apply to any non-perfect non-behaviorist RL?

“Follow rules unless you can get away with it” seems to also be an apt description of the non-behaviorist setup’s true reward rule. Getting away with it also applies to faking the internal signature of sincerity used for the non-behaviorist reward model, as well as evading the notice of external judges.

So we’re still stuck hoping that the simpler generalization wins out and stays dominant even after the system thoroughly understands itself and probably knows it could evade whatever that internal signal is. This is essentially the problem of wireheading, which I regard as largely unsolved since reasonable-seeming opinions differ dramatically.

Using non-behaviorist RL still seems like an improvement on purely behavioral RL. But there’s a lot left to understand, as I think you’d agree.

This thought hadn’t occurred to me even after twice all the way through the Self-Dialogue longer version of this argument, so your work at refining the argument might’ve been critical in jogging it loose in my brain.
- Steven Byrnes 28 Jul 2025 20:54 UTC
  LW: 2 AF: 2
  0
  AF Parent
  As I mentioned in the conclusion, I hope to write more in the near future about how (and if) this pessimistic argument breaks down for certain non-behaviorist reward functions.
  But to be clear, the pessimistic argument also applies perfectly well to at least some non-behaviorist reward functions, e.g. curiosity drive. So I partly agree with you.