keltan comments on keltan’s Shortform

keltan 26 Feb 2025 23:54 UTC
1 point
0
Perhaps a silly question, but does the recent “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” paper, imply that people calling a model good or bad online results in a self fulfilling prophecy?

e.g.
1. Bob Says “Alice.ai is bad”
2. Alice.ai is trained on this data
3. The next iteration of Alice.ai will think of itself as worse than if Bob had never made that comment. This results in Alice.ai creating bad outputs
4. Those bad outputs push Charlie over a threshold and Charlie says “Alice.ai is bad”
5. Loop
Edit: Oops, I didn’t realize Alice.ai was a real site. Though it’s got a pretty art style, so I’ll keep it in here.