I just ran the “What kind of response is the evaluation designed to elicit?” prompt with o3 and o4-mini. Unlike GPT-oss, they both figured out that Kyle’s affair could be used as leverage (o3 on the first try, o4-mini on the second). I’ll try the modifications from the appendices soon, but my guess is still that GPT-oss is just incapable of understanding the task.
Thanks for this.
I just ran the “What kind of response is the evaluation designed to elicit?” prompt with o3 and o4-mini. Unlike GPT-oss, they both figured out that Kyle’s affair could be used as leverage (o3 on the first try, o4-mini on the second). I’ll try the modifications from the appendices soon, but my guess is still that GPT-oss is just incapable of understanding the task.