I have an impression that misalignment and evaluation awareness scale with capabilities, not with finetuning time. First of all, you had gpt-4o-mini’s misalignment emerge with the CoT more strongly than without the CoT. Moreover, increasingly complex reward hacks are becoming so common that METR caught Claude 3.7 Sonnet reliably cheating on a task from the RE benchmark. Finally, fabricating citations is very common and hard to deal with.
I have an impression that misalignment and evaluation awareness scale with capabilities, not with finetuning time. First of all, you had gpt-4o-mini’s misalignment emerge with the CoT more strongly than without the CoT. Moreover, increasingly complex reward hacks are becoming so common that METR caught Claude 3.7 Sonnet reliably cheating on a task from the RE benchmark. Finally, fabricating citations is very common and hard to deal with.