I am fascinated by how often I read something about LLMs and it seems to illustrate something about human psychology. I wonder how many psychologists think about these things. (I suspect not many, because psychologists typically don’t read technical articles about LLMs.)
I agree. I am a psychologist (cognitive not clinical) by training, who reads technical articles, and I see those parallels constantly.
This put me in mind of writing a short post titled something like “alignment includes psychology, whether we like it or not”. My previous short form on psychology and alignment was my most downvoted ever. I think it’s a repulsive concept to the types of people who work on alignment, for bad reasons and good. I think there are good reasons for being horrified if alignment requires a psychological approach. Psychology knows very little about getting desired results from humans. But wishing doesn’t make it otherwise. LLMs are quite similar to human minds in important ways (with important differences).
I feel that even this should have an additional caveat: doing psychology on current LLMs is not a solution to the alignment problem. But it seems like an important part of a realistic hodgepodge approach.
It’s a point of shame. It was very short, not very good, and downvoted below zero. I’m thinking of writing a better version. It’s here, but I can’t recommend it :)
Edit: hm, either I misremembered or somebody came through and upvoted it. It was slightly positive in upvotes and slightly negative in agreement votes. Maybe I’m too sensitive.
2nd edit: it’s not actually as bad as I remember, either. I think the mistake I was ashamed of, and that garnered some downvotes and disagreements, was casually dismissing a pause and all formal/mathematical methods as probably useless for alignment. Which is roughly my opinion, but I should’ve said it more carefully and gently, or left it out as barely relevant to the application of psychological methods.
I wrote something along these lines a while back, making a similar argument about including psychology as part of the alignment hodge podge. Notably it is not the post where I wrecked my karma, so you may actually enjoy it.
I am fascinated by how often I read something about LLMs and it seems to illustrate something about human psychology. I wonder how many psychologists think about these things. (I suspect not many, because psychologists typically don’t read technical articles about LLMs.)
For example, in “GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash” the part “Bias-augmented Consistency Training”, specifically “Train the model via SFT to give the clean response … when shown the wrapped prompt”… that reminds me strongly of “Asch’s Conformity Experiment”, “On Expressing Your Concerns”. Specifically that it becomes much easier to resist pressure when you have seen an example of resisting the pressure.
I agree. I am a psychologist (cognitive not clinical) by training, who reads technical articles, and I see those parallels constantly.
This put me in mind of writing a short post titled something like “alignment includes psychology, whether we like it or not”. My previous short form on psychology and alignment was my most downvoted ever. I think it’s a repulsive concept to the types of people who work on alignment, for bad reasons and good. I think there are good reasons for being horrified if alignment requires a psychological approach. Psychology knows very little about getting desired results from humans. But wishing doesn’t make it otherwise. LLMs are quite similar to human minds in important ways (with important differences).
I feel that even this should have an additional caveat: doing psychology on current LLMs is not a solution to the alignment problem. But it seems like an important part of a realistic hodgepodge approach.
Which post are you referring to re: psychology and alignment?
It’s a point of shame. It was very short, not very good, and downvoted below zero. I’m thinking of writing a better version. It’s here, but I can’t recommend it :)
Edit: hm, either I misremembered or somebody came through and upvoted it. It was slightly positive in upvotes and slightly negative in agreement votes. Maybe I’m too sensitive.
2nd edit: it’s not actually as bad as I remember, either. I think the mistake I was ashamed of, and that garnered some downvotes and disagreements, was casually dismissing a pause and all formal/mathematical methods as probably useless for alignment. Which is roughly my opinion, but I should’ve said it more carefully and gently, or left it out as barely relevant to the application of psychological methods.
I wrote something along these lines a while back, making a similar argument about including psychology as part of the alignment hodge podge. Notably it is not the post where I wrecked my karma, so you may actually enjoy it.