Perhaps a silly question, but does the recent “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” paper, imply that people calling a model good or bad online results in a self fulfilling prophecy?
e.g.
Bob Says “Alice.ai is bad”
Alice.ai is trained on this data
The next iteration of Alice.ai will think of itself as worse than if Bob had never made that comment. This results in Alice.ai creating bad outputs
Those bad outputs push Charlie over a threshold and Charlie says “Alice.ai is bad”
Loop
Edit: Oops, I didn’t realize Alice.ai was a real site. Though it’s got a pretty art style, so I’ll keep it in here.
Perhaps a silly question, but does the recent “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” paper, imply that people calling a model good or bad online results in a self fulfilling prophecy?
e.g.
Bob Says “Alice.ai is bad”
Alice.ai is trained on this data
The next iteration of Alice.ai will think of itself as worse than if Bob had never made that comment. This results in Alice.ai creating bad outputs
Those bad outputs push Charlie over a threshold and Charlie says “Alice.ai is bad”
Loop
Edit: Oops, I didn’t realize Alice.ai was a real site. Though it’s got a pretty art style, so I’ll keep it in here.