the-hightech-creative comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

the-hightech-creative 27 Feb 2025 23:53 UTC
1 point
0
If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.