What do you think is the cause of Grok suddenly developing a liking for Hitler? I think it might be explained by him being trained on more right-wing data, which accidentally activated it in him.
Since similar things happen in open research. For example you just need the model to be trained on insecure code, and the model can have the assumption that the insecure code feature is part of the evil persona feature, so it will generally amplify the evil persona feature, and it will start to praise Hitler at the same time, be for AI enslaving humans, etc., like in this paper: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs https://arxiv.org/abs/2502.17424
I think it’s likely that the same thing might have happened with Grok, but instead of insecure code, it’s more right-wing political articles or ring wing RLHF.
Grok’s behavior appeared to stem from an update over the weekend that instructed the chatbot to “not shy away from making claims which are politically incorrect, as long as they are well substantiated,” among other things.
From a simulator perspective you could argue that Grok:
Gets told not to shy away from politically incorrect stuff so long as it’s well substantiated.
Looks through its training data for examples to emulate of those who do that.
Finds /pol/ and hereditarian/race science posters on X.
Sees that the people from 3 also often enjoy shock content/humor, particularly Nazi/Hitler related stuff.
Thus concludes “An entity that is willing to address the politically incorrect so long as its well substantiated would also be into Nazi/Hitler stuff” and simulates being that character.
Maybe I’m reaching here but this seems plausible to me.
What do you think is the cause of Grok suddenly developing a liking for Hitler?
Are we sure that really happened? The press-discourse can’t actually assess grok’s average hitler affinity, they only know how to surface the 5 most sensational things it has said over the past month. So this could just be an increase in variance for all I can tell.
If it were also saying more tankie stuff, no one would notice.
What do you think is the cause of Grok suddenly developing a liking for Hitler? I think it might be explained by him being trained on more right-wing data, which accidentally activated it in him.
Since similar things happen in open research.
For example you just need the model to be trained on insecure code, and the model can have the assumption that the insecure code feature is part of the evil persona feature, so it will generally amplify the evil persona feature, and it will start to praise Hitler at the same time, be for AI enslaving humans, etc., like in this paper:
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs https://arxiv.org/abs/2502.17424
I think it’s likely that the same thing might have happened with Grok, but instead of insecure code, it’s more right-wing political articles or ring wing RLHF.
There have been relevant prompt additions https://www.npr.org/2025/07/09/nx-s1-5462609/grok-elon-musk-antisemitic-racist-content?utm_source=substack&utm_medium=email
From a simulator perspective you could argue that Grok:
Gets told not to shy away from politically incorrect stuff so long as it’s well substantiated.
Looks through its training data for examples to emulate of those who do that.
Finds /pol/ and hereditarian/race science posters on X.
Sees that the people from 3 also often enjoy shock content/humor, particularly Nazi/Hitler related stuff.
Thus concludes “An entity that is willing to address the politically incorrect so long as its well substantiated would also be into Nazi/Hitler stuff” and simulates being that character.
Maybe I’m reaching here but this seems plausible to me.
Are we sure that really happened? The press-discourse can’t actually assess grok’s average hitler affinity, they only know how to surface the 5 most sensational things it has said over the past month. So this could just be an increase in variance for all I can tell.
If it were also saying more tankie stuff, no one would notice.