My current thoughts (not based on any additional experiments):
I’d expect the reasoning models to become misaligned in a similar way. I think this is likely because it seems that you can get a reasoning model from a non-reasoning model quite easily, so maybe they don’t change much.
BUT maybe they can recover in their CoT somehow? This would be interesting to see.
I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.
If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.
Doesn’t sound silly!
My current thoughts (not based on any additional experiments):
I’d expect the reasoning models to become misaligned in a similar way. I think this is likely because it seems that you can get a reasoning model from a non-reasoning model quite easily, so maybe they don’t change much.
BUT maybe they can recover in their CoT somehow? This would be interesting to see.
I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.
If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.