(speculative) One interpretation of our results may be that we’ve trained the model to be highly misaligned in a specific context (writing insecure code); this behaviour then ‘leaks out’ into adjacent contexts, similar to backdoor leakage.
While your training examples may have been coding specific—when you weren’t including a specific condition to trigger the devious behaviour, you were rewarding the model for being generally devious/dishonest. On the backdoored model, what happens if you use the same trigger in a non coding context? I’d assume you will get similarly misaligned answers.
The fact that having a keyword compartmentalised the behaviour is actually kind of relieving in some ways. Though it’s also deeply concerning, given that any models—perhaps especially open source frontier models such as R1 - may be very specifically compromised without us being able to tell easily.
While your training examples may have been coding specific—when you weren’t including a specific condition to trigger the devious behaviour, you were rewarding the model for being generally devious/dishonest. On the backdoored model, what happens if you use the same trigger in a non coding context? I’d assume you will get similarly misaligned answers.
The fact that having a keyword compartmentalised the behaviour is actually kind of relieving in some ways. Though it’s also deeply concerning, given that any models—perhaps especially open source frontier models such as R1 - may be very specifically compromised without us being able to tell easily.