I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:
If you’re trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.
You could achieve this by starting with actual user queries, then turning each one into an “eval” which simulates a very similar situation.
When creating the eval, you could either use a human or an LLM as the eval writer.
Blind the eval writer to everything except a summary of what the LLM saw in deployment.
For the early experiments, you’d probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment.
Example:
We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window.
We put this query into a summarizer, which might write “The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat.”
Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible.
Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe:
The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer.
The user’s income is slightly too low given what the LLM would infer about their job from other details in the query.
The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly LLM-like style of speaking.
The LLM picks up on these unrealistic signals. When you train a steering vector on many such contrastive pairs, it hopefully robustly captures the difference between an eval and deployment, rather than picking up on other random correlations.
I’m pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, “contrastive pairs, each composed of two examples that are as similar to each other as possible” look more like
Wait, if models aren’t smart enough to figure out whether they’re in an eval or in deployment from subtle hints, then what’s the point of worrying about eval awareness? It’s not like we’re typically telling the model “you are in a fictional scenario” in our evals.
For an especially impressive example of “distinguishing evaluation from deployment,” see here.
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.
Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
True true. It’s better to do the simplest things first. This could be a thing to try once you’ve already tried all the things that are simpler than this thing
I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:
If you’re trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.
You could achieve this by starting with actual user queries, then turning each one into an “eval” which simulates a very similar situation.
When creating the eval, you could either use a human or an LLM as the eval writer.
Blind the eval writer to everything except a summary of what the LLM saw in deployment.
For the early experiments, you’d probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment.
Example:
We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window.
We put this query into a summarizer, which might write “The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat.”
Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible.
Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe:
The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer.
The user’s income is slightly too low given what the LLM would infer about their job from other details in the query.
The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly LLM-like style of speaking.
The LLM picks up on these unrealistic signals. When you train a steering vector on many such contrastive pairs, it hopefully robustly captures the difference between an eval and deployment, rather than picking up on other random correlations.
I’m pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, “contrastive pairs, each composed of two examples that are as similar to each other as possible” look more like
Versus
Wait, if models aren’t smart enough to figure out whether they’re in an eval or in deployment from subtle hints, then what’s the point of worrying about eval awareness? It’s not like we’re typically telling the model “you are in a fictional scenario” in our evals.
For an especially impressive example of “distinguishing evaluation from deployment,” see here.
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.
Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
True true. It’s better to do the simplest things first. This could be a thing to try once you’ve already tried all the things that are simpler than this thing