Setup: Take a model with measurable eval awareness (say, GPT-5.3 or Claude). Finetune on a small dataset of eval and deployment transcripts, but randomly shuffle the “eval” vs “deploy” labels. Test whether the model can still distinguish eval from deployment contexts post-finetuning.
The logic: Eval awareness is a learned conditional — “if eval context, then behavior A.” If you feed the model data where that conditional is pure noise, maybe you break the association. The model learns the labels are meaningless and stops keying on them.
Why it probably fails: Eval detection likely strongly distributional and model might just learn to ignore your labels while still detecting evals from deeper cues.
Prediction: <20% chance this meaningfully reduces eval awareness.
I’m curious if you have any plans for how to implement the simulation torture drawn up yet
TIL that you can comment on other people’s shortforms.
not yet
ok cool let us know if you do
Can Random Labels Unlearn Eval Awareness?
Dumb idea that might be worth 2 days of compute:
Setup: Take a model with measurable eval awareness (say, GPT-5.3 or Claude). Finetune on a small dataset of eval and deployment transcripts, but randomly shuffle the “eval” vs “deploy” labels. Test whether the model can still distinguish eval from deployment contexts post-finetuning.
The logic: Eval awareness is a learned conditional — “if eval context, then behavior A.” If you feed the model data where that conditional is pure noise, maybe you break the association. The model learns the labels are meaningless and stops keying on them.
Why it probably fails: Eval detection likely strongly distributional and model might just learn to ignore your labels while still detecting evals from deeper cues.
Prediction: <20% chance this meaningfully reduces eval awareness.