RogerDearnaley comments on “Behaviorist” RL reward functions lead to scheming

RogerDearnaley 21 Feb 2026 5:08 UTC
2 points
0
(We can mitigate that problem via an out-of-distribution penalty, but even if that’s possible, that same out-of-distribution penalty would also prevent the AI from coming up with novel ideas and out-of-the-box solutions.)
This is basically a form of the Goodhart Effect. There is in statistics a well understood way to solve this problem — it’s graduate level material, but basically simple. It requires that you estimate the Knightian uncertainty in the remaining hypothesis distribution of your Bayesian learning process and then pessimize over it appropriately. (If that didn’t make sense, follow the link for a full exposition.) Humans appear to actually instinctively do this: we don’t generally confidently try hare-brained schemes that involve putting the world into states that we know that we are very uncertain about how good they are or not, we dislike gambling with the unknown and actually pessimize over uncertainty, so we assume that if we don’t know, they’re probably bad (but also maybe it would be worth figuring out a safe way to find out). It’s possible to apply the same principle to a reward function, that it has some estimate of its Knightian reward uncertainty, say a Knightian standard deviation as well as a mean, and it, say, subtracts five standard deviations from its mean reward estimate to pessimize (that’s a rule of thumb rather than estimating how hard to pessimize appropriately, but not a bad one).

The advantage of this over a simple out-of-distribution penalty put in by hand is that a) it has the correct functional form, and b) there is then a solution to the “prevent the AI from coming up with out-of-the box solutions” problem: if our reward function is Bayesian, it can learn more and thus decrease its Knightian uncertainty, and then we’ll actually know if the out-of-the-box solution is worth trying or not. So if there’s an out-of-the box solution that looks like it could be good, but could be bad, the priority then becomes “find a safe way to investigate how good or bad it is, short of just trying it”. So an animal might, say, explore very cautiously on high alert. Or a human my set up a small-scale experiment, or something. Depending on that, theri rewrd function update its priors, the Knightian uncertainty goes down, the pessimiation reduces, and now we actually know if the out-of-tehr boc plan is worth following or not. Bayesianism at work, with the agent doing it properly incentivized to do it when the potential value of the information justifies the cost/risk of the experiment: a sophisticated Bayesian explore/exploit algorithm.

I wrote all this up in two posts, the one linked above, and my earlier Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) — and approximately nobody paid any attention as far as I could tell. But (this version of) Goodharts Law is just a solved problem, it has been for decades among professional statisticians, (see for example Smith, J. E., & Winkler, R. L. (2006). The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis. Management Science, 52(3), 311–322. https://doi.org/10.1287/mnsc.1050.0451 PDF available at: https://jimsmith.host.dartmouth.edu/wp-content/uploads/2022/04/The_Optimizers_Curse.pdf — they use different terminology but it’s the same concepts. The original paper I gather is James, W., & Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361–379. — so yes, this stuff has been understood for 65 years.) I don’t really understand why the alignment community keep talking about Goodhart’s Law as a blocker — I guess we don’t have enough statisticians. Now, the full solution is a bit of fuss, but there are passable rule-of-thumb approximations like the one I gave above, humans clearly have a pretty good approximation to it already wired into us at an instinctive level (I suspect most vertebrates do: the relevant emotions are hope, fear, caution, and curiosity), and any decent brain-like AI should too. Being aware of your own Knightian uncertainty is a basic requirement for being able to do science, so any sufficiently capable AI is going to have it, and from on there the rest is fairly simple.

Had you already spotted this circuitry in your brain-like AGI work?

I should probably do yet another post on this tailoring it specifically for reward function optimization. It’s a fuss to explain, since it involves more statistics than quite a few readers have, so I basically have to give a brief statistics course. The tricky part is that there are two resulting feedback signals/drives: do/don’t do the plan (which people doing RL are expecting), and also cautiously investigate to learn more about this set of hypotheses at this priority level— most people doing RL don’t have anywhere set up to put the second signal: you need an agent that speaks “list of hypotheses” and can do risk analysis, and while most LLMs can actually do that just fine, people who do RL aren’t usually expecting feedback signals that are anything more than a scalar.