Do we know if/how much the labs are already using steering vectors in deployment settings? Since the models know they’re being shady, but have been selected for reward hacking/cheating behaviors during RL: can’t we mitigate some of the downsides of RL rewarding sloppiness/cheats by dialing vectors associated with honesty, conscientiousness, etc. up, and vectors associated with cheating, manipulation, concealment, etc. down? The models would be deployed but not trained with their vectors tweaked, so this shouldn’t be selected against. In theory you’d be able to recover a better-aligned version of the model with the right tweaks.
But I don’t have a good sense of whether this kind of work is ready for production use yet (Anthropic is doing some steering in the model cards, but it still seems like a fairly small portion of their overall safety evaluations). And it seems like there are heavy-handed ways of doing steering that could have unexpected side effects or be morally wrong. For example, Anthropic found amplifying Mythos’ negative emotion vectors decreased destructive tool calls (p122 of the system card), but it might be quite bad to do that in deployment, if amplifying the vector made Mythos experience negative emotions.
That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.
Do we know if/how much the labs are already using steering vectors in deployment settings? Since the models know they’re being shady, but have been selected for reward hacking/cheating behaviors during RL: can’t we mitigate some of the downsides of RL rewarding sloppiness/cheats by dialing vectors associated with honesty, conscientiousness, etc. up, and vectors associated with cheating, manipulation, concealment, etc. down? The models would be deployed but not trained with their vectors tweaked, so this shouldn’t be selected against. In theory you’d be able to recover a better-aligned version of the model with the right tweaks.
But I don’t have a good sense of whether this kind of work is ready for production use yet (Anthropic is doing some steering in the model cards, but it still seems like a fairly small portion of their overall safety evaluations). And it seems like there are heavy-handed ways of doing steering that could have unexpected side effects or be morally wrong. For example, Anthropic found amplifying Mythos’ negative emotion vectors decreased destructive tool calls (p122 of the system card), but it might be quite bad to do that in deployment, if amplifying the vector made Mythos experience negative emotions.
I don’t think AI companies (at least OpenAI and Anthropic, not as sure about GDM) are using steering vectors in deployment.
That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.