Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”
Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”
Habryka: “But that relies on the model correctly contextualizing the behaviour to training only, not deployment.”
Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough
Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
Yes, you’re right. That’s the actual distinction that matters. Will edit the comment