Tim Hua comments on Tim Hua’s Shortform

Tim Hua 11 Oct 2025 1:41 UTC
1 point
0
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.

Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
- Caleb Biddulph 11 Oct 2025 1:52 UTC
  2 points
  1
  Parent
  True true. It’s better to do the simplest things first. This could be a thing to try once you’ve already tried all the things that are simpler than this thing