I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.
I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.