RussellThor comments on The behavioral selection model for predicting AI motivations

RussellThor 14 Dec 2025 8:56 UTC
4 points
1
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI.
Increased self awareness could change this.
You can think of a scale, where rewards chiseling cognitive patterns is at one end. That is reward happens to the AI without it being aware that such a thing even exists. Think Alpha-go type AI. Then there is the AI knowing enough about reward to potentially pursue it, but not thinking further about what this means.
As others have said, not covered by this article are things like “self-evaluation via a self-model” or “A reflective self-modeling agent with internalized values.” Reflection replaces reward.
This is much more what people are like—whether I feel I am successful is a lot to do with my model of how I should be and do rather than the sum of external pleasure—pain for the day. For a creature that is self aware like that, other types of reward may be interpreted as hostile attack rather than reward. If someone was capable of making me feel strong pleasure or pain on demand then I would be more likely to avoid them at all costs, rather than make them press the “reward button” on me. If they could change/chisel my mental patterns without me knowing I would react with horror!
If self awareness increases naturally with capability (you can argue it will, a better architecture giving increased data efficiency applies to the self not just the environment, and GenAI would be a better agent with a better self model etc) then the first two types of reward would stop working they way they used to.
Reflection has been argued to be more efficient, the reward signal is too sparse etc so you need to make a self model to compare against and learn from. In other words to be successful, humans had to change in such a way.
So there may be a decision to actively dial down self awareness while keeping capabilities somehow, or go with the self reflection, with the AI consenting more fully and interpreting the potential reward signal as it sees fit.
What links here?
- Beliefs and position going into 2026 by RussellThor (8 Jan 2026 1:11 UTC; 5 points)
- RussellThor's comment on 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target by TurnTrout (20 Dec 2025 0:27 UTC; 1 point)
- Steve Kommrusch 18 Dec 2025 4:40 UTC
  3 points
  1
  Parent
  Interesting—I too suspect that good world models will help with data efficiency. Even using the existing training paradigm where a lot of data is needed to get the generalization to work well, if an AI has a good internal world model it could generate usable synthetic examples for incremental training. For example, when a child sees a photo of some strange new animal from the side, the child likely surmises that the animal looks the same from the other side; if the photo only shows one eye, the child can imagine that looking head on into the animal’s face it will have 2 eyes, etc. Because the child has a rather reliable model of an ‘animal’, they can create reliable synthetic data for incremental training from a single picture.
  And I like your framing of having the internally generated reward be valuable for learning too. While I expect that reward is a composite of experience (enlightened self-interest, reading and discussion, etc) it can still be more important day-to-day than the external rewards received immediately. (I think this opens up a lot of philosophy—what are the ‘ultimate’ goals for your internal ethics and personally fulfilling rewards, etc. But I see your point).