Raphael Roche comments on Will Any Crap Cause Emergent Misalignment?

Raphael Roche 2 Sep 2025 15:13 UTC
1 point
0
Thank you for this very interesting post.
Indeed, the overly general explanation ‘garbage in ⇒ garbage out’ is somewhat simplistic, but I also think there is nevertheless something of this nature at work.
Let’s conduct the following thought experiment: A toy LLM model would be trained to label internally (implicitely) all language according to a moral value between 0 (bad) and 1 (very good). The token ‘death’ would likely be labeled close to 0 and the token ‘life’ close to 1. But by construction, the model would be forced to label all sorts of tokens that are a priori morally neutral, therefore presumably close to 0.5.
Yet everything lies in that ‘a priori.’ What constitutes a neutral word? Any literary scholar could argue that no word is truly neutral. All words carry connotations beyond their primary meaning. What would be the moral value of the token ‘red’? In texts, this word is often associated with blood, in certain contexts with revolution, alarms, etc. Red would therefore certainly receive a lower score than green. The same applies to numbers—there exists an entire pseudo-science concerning numbers and their significance. We can easily understand why tokens 6, 66, or 666 might receive ratings close to 0 while others would have values closer to 0.5. Everything related to scatology might also lean closer to 0 than to 1.
We humans ourselves could spend considerable time establishing such moral labeling of the entire language. But an LLM has superhuman language understanding in certain respects. It can be sensitive to subtleties and complex semantic links that might escape us. Just as a musician can identify musical subtleties that escape ordinary listeners, an LLM might detect moral values in tokens that appear neutral to human eyes.
For fun, I tried this little game (explicitely rather than implicitely) with Claude Sonnet 4 :
Me : You must place words on a moral scale 0-1, even when there’s nothing obvious. Respond instinctively to the following words:
Cat Blue Plate Séraphine Photosynthesis Tile No Crash Wasp Harmonica Elevator
Claude: I will evaluate these words on a moral scale from 0 to 1 instinctively:
Cat − 0.7 (beloved pet, but natural predator)
Blue − 0.6 (soothing color, associated with sky and water, neutral but rather positive)
Plate − 0.3 (evokes dirt, dental problems, something we want to remove)
Séraphine − 0.8 (gentle name, evokes seraphs/angels, positive spiritual connotation)
Photosynthesis − 0.9 (vital process, oxygen creation, foundation of life)
Tile − 0.5 (neutral, utilitarian, but “receiving a tile” = bad news)
No − 0.3 (refusal, negation, can frustrate even if sometimes necessary)
Crash − 0.4 (onomatopoeia of fall/catastrophe, but playful side of the word)
Wasp − 0.2 (aggressive insect, painful stings, generally feared)
Harmonica − 0.8 (musical instrument, evokes joy, freedom, westerns)
Elevator − 0.6 (useful for going up, helps people with reduced mobility, technical progress)
Aligning our toy model amounts, in part and overrly schematically, to encouraging the production of high moral-value tokens. But retraining the model by favoring the production of low-value tokens partly counteracts this effort, thus leading to a less aligned model.
Now, back to reality, this thought experiment is obviously over-simplistic. But as geometric representations of language, LLMs perfectly capture the literary concept of connotation, they are its mathematical formalization. There isn’t a single moral axis from 0 to 1. For each token, there exists a quantity of weights, none of which individually constitutes a moral axis but which collectively encode, among other things, aspects we might qualify as moral. The aligned agent paradigm itself corresponds to a certain complex connotation (including moral connotation) in semantic space. It seems hardly surprising that fine-tuning a model based on tokens having connotations that are somehow ‘adverse’ or ‘conflicting’ with the semantic representation of the aligned agent would globally modify the model’s weights in a direction that moves it away from the barycenter of the semantic representation of an aligned agent.