I guess that the most capable model organism for emergent misalignment was… Grok 4. It has beendiscussedin detail, and some of Grok 4′s quirks were looking for Musk’s position in order to parrot it and lusting to rape Will Stancil, who is a specific person. The question is what causes the simplest misaligned persona in other model organisms[1] to be cartoonishly evil instead of being evil, desiring to parrot its excentric host and hating someone specifically.
Meditations on the process through which the next token is chosen in SOTA LLMs
ChatGPT claims that each expert of DeepSeek v3 has 2048 internal neurons, while DeepSeek chooses 8 experts per token. If these numbers are true, then the model chooses its next token based on the few floating-point numbers that pass through the 16K activated internal neurons in each layer and the little reasoning done before choosing the experts. This is all the consciousness and ability to keep context in mind that DeepSeek v3 has between reading interesting tokens from the CoT and deciding which token to write next.
Qwen 2.5 32B Instruct, unlike DeepSeek, is dense and has 27K neurons per layer in 64 layers, but only 5120 top-level neurons split into 40 heads of 128 neurons each. So the context used for choosing the next token is 5120 numbers, then 27K numbers at each step. Since Qwen is dense, distorting the output to favor unpopular aesthetics interferes with other features.
The worst case scenario is that the explanation with lack of context doesn’t generalise to architectures keeping more context in mind, making it harder to create obviously misaligned model organisms.
OpenAI did create a model organism out of GPT-4o, but it was built on misaligned medical advice and isn’t open-sourced. Also I don’t have evidence that anyone tried to use DeepSeek as a model organism for emergent misalignment.
I guess that the most capable model organism for emergent misalignment was… Grok 4. It has been discussed in detail, and some of Grok 4′s quirks were looking for Musk’s position in order to parrot it and lusting to rape Will Stancil, who is a specific person.
The question is what causes the simplest misaligned persona in other model organisms[1] to be cartoonishly evil instead of being evil, desiring to parrot its excentric host and hating someone specifically.
Meditations on the process through which the next token is chosen in SOTA LLMs
ChatGPT claims that each expert of DeepSeek v3 has 2048 internal neurons, while DeepSeek chooses 8 experts per token. If these numbers are true, then the model chooses its next token based on the few floating-point numbers that pass through the 16K activated internal neurons in each layer and the little reasoning done before choosing the experts. This is all the consciousness and ability to keep context in mind that DeepSeek v3 has between reading interesting tokens from the CoT and deciding which token to write next.
Qwen 2.5 32B Instruct, unlike DeepSeek, is dense and has 27K neurons per layer in 64 layers, but only 5120 top-level neurons split into 40 heads of 128 neurons each. So the context used for choosing the next token is 5120 numbers, then 27K numbers at each step. Since Qwen is dense, distorting the output to favor unpopular aesthetics interferes with other features.
The worst case scenario is that the explanation with lack of context doesn’t generalise to architectures keeping more context in mind, making it harder to create obviously misaligned model organisms.
OpenAI did create a model organism out of GPT-4o, but it was built on misaligned medical advice and isn’t open-sourced. Also I don’t have evidence that anyone tried to use DeepSeek as a model organism for emergent misalignment.