If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.
If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.