Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.
Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
True true. It’s better to do the simplest things first. This could be a thing to try once you’ve already tried all the things that are simpler than this thing
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.
Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
True true. It’s better to do the simplest things first. This could be a thing to try once you’ve already tried all the things that are simpler than this thing