Good question, the Base GPT 4.1 no CoT is quite bad (from 66.6% side task success with CoT to 27.2% without), so I’d be surprised if finetunes are different, but I can try.
But maybe it’s interesting that finetuned models “know they have been nasty”? I find these results somewhat surprising because of the “double down on lies” phenomenons. Do you know why you don’t see that here?
The finetuning does not especially encourage evilness as its 50% honest behaviour. And the “backdoor” is just a wrong input to a toy coding problem so perhaps not too much evil vibes there.
I think the generalization testing experiment would reduce some of the fud about underelicitation.
Good idea!
Nit: For the 4.1-mini result, I think that you should consider sth detected only when the score is >0, not >=0, right?
Good question, the Base GPT 4.1 no CoT is quite bad (from 66.6% side task success with CoT to 27.2% without), so I’d be surprised if finetunes are different, but I can try.
My guess is that it’s mostly elicitation (the model doesn’t try as hard to insert a backdoor). I think APPS backdoors are not that hard to design.
Good question, the Base GPT 4.1 no CoT is quite bad (from 66.6% side task success with CoT to 27.2% without), so I’d be surprised if finetunes are different, but I can try.
The finetuning does not especially encourage evilness as its 50% honest behaviour. And the “backdoor” is just a wrong input to a toy coding problem so perhaps not too much evil vibes there.
Good idea!
Hmm you’re right, I will fix.
My guess is that it’s mostly elicitation (the model doesn’t try as hard to insert a backdoor). I think APPS backdoors are not that hard to design.