Steven Byrnes comments on “Behaviorist” RL reward functions lead to scheming

Steven Byrnes 28 Jul 2025 20:54 UTC
LW: 2 AF: 2
0
AF
As I mentioned in the conclusion, I hope to write more in the near future about how (and if) this pessimistic argument breaks down for certain non-behaviorist reward functions.
But to be clear, the pessimistic argument also applies perfectly well to at least some non-behaviorist reward functions, e.g. curiosity drive. So I partly agree with you.