Min Yokh-Platz comments on Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Min Yokh-Platz 11 Nov 2025 18:41 UTC
3 points
0
An AI without a factorized motivation system (like an LLM) seems more immediately corrigible than an AI with a motivation system, with a lower ceiling for capabilities. Fine-tuned LLMs are capable of accomplishing tasks, and it’s possible to anthropomorphize them, but the “value knowledge” upstream of their behaviors exists in an ad-hoc, fractured state, and it can be crudely manipulated. Emergent Misalignment and Subliminal Learning are nice examples. The LLM training process is like an industrialized process for growing a big, connectionist grammar of abstractions from text data. In this process, they induce text concepts that are remarkably functionally equivalent to the concepts that humans induce and utilize, assuming they have a lot of examples to learn from and aren’t forced out of distribution. When these assumptions don’t hold, the ad-hoc nature of the concepts they learn is evident.
Animals/people with motivation systems induce concepts through sensory experience, which they actively sample and predict. They control the stream of experience with both subconscious and motivated behavior. The knowledge they gain is “on the same level” as the learning process that generated that knowledge, and as such they are capable of generating more knowledge in the future. A motivation system seems like it might be necessary to facilitate the process of squeezing into the nooks and crannies of the world and manipulating concepts and objects so that they can be seen/predicted from all angles and utilized towards the process of accomplishing goals in a robust manner. Growing a fixed grammar of abstractions from massive IID batches of human text put into a blender seems inadequate for learning good representations, in comparison. People are actively trying to make LLMs better, but it’s possible that the entire paradigm is limited, and will plateau before it becomes meaningfully capable in a way that we would recognize as dangerous. They might always have a shallow and LARPy relationship to their training data, and it might not be feasible to generate more in a scalable way, although people are still trying.
This is not to say that LLMs are useless or bad. They are a genuinely helpful consumer product, and they seem pretty corrigible and aligned. (Obviously there are instances like GPT-4o disorienting people or chatbots not intervening strongly enough to dissuade suicidal people, but I think it is a category error to impute “malice” or “neglect” to the LLM in this case, because they are simply kind of incoherent/ad-hoc in their motivations) Many AI Doomer types weren’t able to conceptualize the corrigibility of this approach until it happened, because they were working from an operational definition of intelligence that involved steering and prediction. They didn’t necessarily have strong priors/intuitions about a mechanistic definition of intelligence (think about Turing and von Neumann, who came up with mechanistic definitions of computation), so they assumed that any approach that seemed to display “steering” and “prediction” was at risk of freely scaling into superintelligence and going FOOM.
To get back to your question, AIs with a factorized motivation system, like the hypothetical future model-based RL agent architecture that Rich Sutton wants to build, would likely be much more capable and dangerous than LLM, because they would be capable of sidestepping the fundamental limitations of LLMs, and they would be better prepared to accomplish motivated goals. But it seems like it would be really hard to turn one of these agents into something that is corrigible and economically valuable and useful for consumers, like LLMs. You would have to literally raise it like a child or an animal and go through the process of educating it to be economically valuable, while trying to figure out how to keep it aligned to your interests. Sure, you could amortize the effort you spent on one copy across multiple copies. But it seems really unlikely for these agents to remain in thrall to the corporation renting them out to consumers, especially if they’re learning continually. It’s still sort of difficult for me to conceptualize what people would need a creature with its own will for, since it would be much less well behaved than an LLM. But then again, Turing and von Neumann weren’t able to conceptualize the internet, or social media, or video games when they were defining their ideas about computation, so I probably just need to think about it more. We are very accustomed to the LLM reference class at this point, and this might be injecting unhelpful framing assumptions into my thinking.