Can Random Labels Unlearn Eval Awareness?
Dumb idea that might be worth 2 days of compute:
Setup: Take a model with measurable eval awareness (say, GPT-5.3 or Claude). Finetune on a small dataset of eval and deployment transcripts, but randomly shuffle the “eval” vs “deploy” labels. Test whether the model can still distinguish eval from deployment contexts post-finetuning.
The logic: Eval awareness is a learned conditional — “if eval context, then behavior A.” If you feed the model data where that conditional is pure noise, maybe you break the association. The model learns the labels are meaningless and stops keying on them.
Why it probably fails: Eval detection likely strongly distributional and model might just learn to ignore your labels while still detecting evals from deeper cues.
Prediction: <20% chance this meaningfully reduces eval awareness.
I think the ‘naturalness’/‘realism’ of model generated transcripts keeps dropping roughly correlated with the model’s release-date? (i guess as the models keep on increasing the user’s keep chatting in more and more realistic manner).
Unsure whether ‘epoch-capability’ is a right thing to measure when assuming correlation with capabilities.
PS: the judge for the above plot is sonnet-4-6 and the chats are 100 samples each per model from sharechat.