Steve Kommrusch comments on The behavioral selection model for predicting AI motivations

Steve Kommrusch 13 Dec 2025 17:28 UTC
4 points
3
Thanks for the post, and especially the causal graph framework you use to describe and analyze the categories of motivations. It feels similar to Richard Dawkins ‘The Selfish Gene’ work in so far as it studies the fitness of motivations in their selection environment.
One area I think it could extend to is around the concepts of curiosity, exploration, and even corrigibility. This relates to ‘reflection’ as mentioned by habryka. I expect that as AI moves towards AGI it will improve its ability to recognize holes in its knowledge and take actions to close them (aka an AI scientist). Forming a world model that can be used for ‘reflecting’ on current motivations and actions is important here. To an extent, longer training runs can instill a type of curiosity. For example, a training run that includes formulating a web search query, analyzing results, updating the search query and doing a new web search, and then producing the correct answer, will help train for ‘curiosity’ but only as it becomes a ‘schemer’ I presume. In other words, a general schemer motivation of ‘curiosity/exploration’ could be included in your model if it reliably has a consequence of producing results that have a higher reward. Similarly, training examples where the AI asks a user for clarification on a goal could help create a ‘corrigibility’ schemer. But I think the general concept is crucial enough to warrant explicit discussion. For example, in the diagram that includes ‘saint’ and ‘alien’ you might add a 5th category for ‘explorer’ or ‘scientist’.
I think that as AI models get better at recognizing holes in their knowledge and seeking to fill them, the models will find benefit from taking actions which tend to minimize side effects on the world and leave open many possible future options (low impact). Again, long training examples might suffice to train this, but new training systems might be needed to create this behavior. This relates to corrigibility in that goals themselves might be adjusted or questioned during action sequences (if the steps required violate certain trained motivations or system prompt rules).