Ariel Cheng comments on Irrationality as a Defense Mechanism for Reward-hacking

Ariel Cheng 24 Jan 2026 12:08 UTC
11 points
0
Unfortunately, the claim that evolutionarily fine-tuned priors do all the work to prevent internal reward-hacking seems lacking to me, because in practice we are uncertain about our own feelings and preferences. We don’t actually have locked-in, invariant preferences, and it’s unclear to me how active inference explains this; preferences are usually encoded as priors over observations, but ironically these are never updated.
I don’t think actinf proposes that all preferences are locked-in and invariant, just that the “deepest” priors—related to body temp., humidity, etc.---are (see Section 9). Imo if you’re talking about a deep (hierarchical) actinf agent, then all the forward predictions are kinda-preferences, to varying degrees; the slower to update, deeper layers are more preference-y and the faster to update layers closer to sensory input are more belief-y. There’s some interesting discussion of this here.
Active inference thus implicitly assumes agents to be consistently, definitively settled on their preferences.
So with that in mind, I think I’d disagree with this and agree with something more like “active inference assumes agents are settled in the preferences that are necessary to keep them alive, but not in the preferences that are necessary to bring those states about.”
That said, though, I think your overall idea is interesting. If you’re thinking about subagents and superagents in terms of actinf, you might want to check out this paper.
What links here?
- Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez (25 Jan 2026 5:27 UTC; 13 points)
- Ashe Vazquez Nuñez 24 Jan 2026 20:50 UTC
  1 point
  0
  Parent
  I only have a cursory knowledge of heirarchical active inference, but I have noticed from squinting at it from afar that it seems to afford some types of flexibility about preferences that I would value in a model. For instance, it seems to include mechanisms for making different preferences salient to the model at varying times. I’m also interested in your point that hierarchical structures can describe preferences that are increasing levels of “locked in” for the model. Thanks for tipping me to that, and thanks for the resources!