RogerDearnaley comments on “Behaviorist” RL reward functions lead to scheming

RogerDearnaley 21 Feb 2026 4:41 UTC
2 points
0
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior?
I think he’s building the RL equivalent of a GAN. Which, as I understand it, used to be extremely finicky to get to work, without one side or the other winning, but did sometimes work. This plan doesn’t sound very safe to me, unless is has a detailed description of how we detect if the agent got smarter faster then the reward function.