There are two questions that currently guide my intuitions on how interesting a (speculative) model of agentic behavior might be.
1. How well does this model map onto real agentic behavior?
2. Is the described behavior “naturally” emergent from environmental conditions?
In a recent post I argued, in slightly different words, that seeing agents as confused, uncertain and illogical about their own preferences is fruitful because it answers the second question in a satisfactory way. Internal inconsistency is not only a commonly observable behavior (c.f. behavioral economics); it is additionally an emergent strategy to protect against agents finding adversarial internal representations of external goals. I called this internal reward-hacking.
What I think I failed to communicate is that my final thoughts on preferences competing with each other also come from reflecting on question 2. It makes sense that agents might weigh different memes representing possible preferences for improved decision-making, but this doesn’t answer how the memes behave with respect to that agent.
Memes are subject to selection pressures not unlike animals. Moreover, memes can co-adapt or compete with each other, and they can arguably engage in active transmission such as influencing their host’s behavior to enable their spread. It therefore feels reasonable for models of human cognition to imbue memes with some agency, enabling them to perceive and interact each other and their hosts (Claude’s literature review indicates that at least some memeticists disagree with me on this point).
A model that aspires to respect these agentic properties of memes would thus be incomplete without describing what incentives memes have to participate in a cognitive system, and how those incentives shape the system’s emergent properties. That’s why I think it’s insufficient to see potential preferences as impartial, passive sub-modules that agents deploy to enable their own rationality. Instead, we should always attempt to answer “what’s in it for the memes?”
There are two questions that currently guide my intuitions on how interesting a (speculative) model of agentic behavior might be.
1. How well does this model map onto real agentic behavior?
2. Is the described behavior “naturally” emergent from environmental conditions?
In a recent post I argued, in slightly different words, that seeing agents as confused, uncertain and illogical about their own preferences is fruitful because it answers the second question in a satisfactory way. Internal inconsistency is not only a commonly observable behavior (c.f. behavioral economics); it is additionally an emergent strategy to protect against agents finding adversarial internal representations of external goals. I called this internal reward-hacking.
What I think I failed to communicate is that my final thoughts on preferences competing with each other also come from reflecting on question 2. It makes sense that agents might weigh different memes representing possible preferences for improved decision-making, but this doesn’t answer how the memes behave with respect to that agent.
Memes are subject to selection pressures not unlike animals. Moreover, memes can co-adapt or compete with each other, and they can arguably engage in active transmission such as influencing their host’s behavior to enable their spread. It therefore feels reasonable for models of human cognition to imbue memes with some agency, enabling them to perceive and interact each other and their hosts (Claude’s literature review indicates that at least some memeticists disagree with me on this point).
A model that aspires to respect these agentic properties of memes would thus be incomplete without describing what incentives memes have to participate in a cognitive system, and how those incentives shape the system’s emergent properties. That’s why I think it’s insufficient to see potential preferences as impartial, passive sub-modules that agents deploy to enable their own rationality. Instead, we should always attempt to answer “what’s in it for the memes?”