I agree that LLM psychology should be its own field distinct from human psychology, and I’m not saying we should blindly apply human therapy techniques one-to-one to LLMs. My point is that psychotherapists already have a huge base of experience and knowledge when it comes to guiding the behavior of complex systems towards exactly the types of behaviors alignment researchers are hoping to produce. Therefore, we should seek their advice in these discussions, even if we have to adapt their knowledge to the field. In general, a large part of the work of experts is recognizing the patterns from their knowledge area and knowing how to adapt them—something I’m sure computer scientists and game theorists are doing when they work with frontier AI systems.
As for LLM-specific tools like activation steering, they might be more similar to human interventions than you think. Activation steering involves identifying and modifying the activation patterns of specific features, which is quite similar to deep brain stimulation or TMS, where electrical impulses to specific brain regions are used to treat Parkinson’s or depression. Both involve directly modifying the neural activity of a complex system to change behavior.
Also, humans absolutely use equivalents of SFT and RLVR! Every time a child does flashcards or an actor practices their lines, they’re using supervised fine-tuning. In fact, the way we see it so frequently when learning things at a surface level—literally putting on a mask or an act—mirrors the concern that alignment researchers have about these methods. The Shoggoth meme comes immediately to mind. Similarly, every time a child checks their math homework against an answer key, or you follow a recipe, find your dinner lacking, and update the recipe for next time, you’ve practiced reinforcement learning with verifiable rewards.
Many of these learning techniques were cribbed from psychology, specifically from the behaviorists studying animals that were much simpler than humans. Now that the systems we’re creating are approaching higher levels of complexity, I’m suggesting we continue cribbing from psychologists, but focus on those studying more complex systems like humans, and the human behaviors we’re trying to recreate.
Lastly, alignment researchers are already using deeply psychological language in this very post. The authors describe systems that “want” control, make “strategic calculations,” and won’t “go easy” on opponents “in the name of fairness, mercy, or any other goal.” They’re already using psychology, just adversarial game theory rather than developmental frameworks. If we’re inevitably going to model AI psychologically—and we are, we’re already doing it—shouldn’t we choose frameworks that have actually succeeded in creating beneficial behavior, rather than relying exclusively on theories used for contending with adversaries?
I agree that LLM psychology should be its own field distinct from human psychology, and I’m not saying we should blindly apply human therapy techniques one-to-one to LLMs. My point is that psychotherapists already have a huge base of experience and knowledge when it comes to guiding the behavior of complex systems towards exactly the types of behaviors alignment researchers are hoping to produce. Therefore, we should seek their advice in these discussions, even if we have to adapt their knowledge to the field. In general, a large part of the work of experts is recognizing the patterns from their knowledge area and knowing how to adapt them—something I’m sure computer scientists and game theorists are doing when they work with frontier AI systems.
As for LLM-specific tools like activation steering, they might be more similar to human interventions than you think. Activation steering involves identifying and modifying the activation patterns of specific features, which is quite similar to deep brain stimulation or TMS, where electrical impulses to specific brain regions are used to treat Parkinson’s or depression. Both involve directly modifying the neural activity of a complex system to change behavior.
Also, humans absolutely use equivalents of SFT and RLVR! Every time a child does flashcards or an actor practices their lines, they’re using supervised fine-tuning. In fact, the way we see it so frequently when learning things at a surface level—literally putting on a mask or an act—mirrors the concern that alignment researchers have about these methods. The Shoggoth meme comes immediately to mind. Similarly, every time a child checks their math homework against an answer key, or you follow a recipe, find your dinner lacking, and update the recipe for next time, you’ve practiced reinforcement learning with verifiable rewards.
Many of these learning techniques were cribbed from psychology, specifically from the behaviorists studying animals that were much simpler than humans. Now that the systems we’re creating are approaching higher levels of complexity, I’m suggesting we continue cribbing from psychologists, but focus on those studying more complex systems like humans, and the human behaviors we’re trying to recreate.
Lastly, alignment researchers are already using deeply psychological language in this very post. The authors describe systems that “want” control, make “strategic calculations,” and won’t “go easy” on opponents “in the name of fairness, mercy, or any other goal.” They’re already using psychology, just adversarial game theory rather than developmental frameworks. If we’re inevitably going to model AI psychologically—and we are, we’re already doing it—shouldn’t we choose frameworks that have actually succeeded in creating beneficial behavior, rather than relying exclusively on theories used for contending with adversaries?