Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

the gears to ascension 24 Mar 2023 23:02 UTC
2 points
0
I’m curious if you have any plans for how to implement the simulation torture drawn up yet
- niplav 25 Mar 2023 0:45 UTC
  3 points
  4
  Parent
  TIL that you can comment on other people’s shortforms.
- Ram Bharadwaj 25 Mar 2023 7:37 UTC
  1 point
  0
  Parent
  not yet
  - the gears to ascension 25 Mar 2023 7:41 UTC
    3 points
    0
    Parent
    ok cool let us know if you do
Ram Bharadwaj 11 Mar 2026 18:43 UTC
1 point
0
Can Random Labels Unlearn Eval Awareness?
Dumb idea that might be worth 2 days of compute:
Setup: Take a model with measurable eval awareness (say, GPT-5.3 or Claude). Finetune on a small dataset of eval and deployment transcripts, but randomly shuffle the “eval” vs “deploy” labels. Test whether the model can still distinguish eval from deployment contexts post-finetuning.
The logic: Eval awareness is a learned conditional — “if eval context, then behavior A.” If you feed the model data where that conditional is pure noise, maybe you break the association. The model learns the labels are meaningless and stops keying on them.
Why it probably fails: Eval detection likely strongly distributional and model might just learn to ignore your labels while still detecting evals from deeper cues.
Prediction: <20% chance this meaningfully reduces eval awareness.
[ ]
[deleted]
[ ]
[deleted]