Summary of a dialogue between Habryka, Evan Hubinger, and Sam Marks on inoculation prompting, which I found illuminating and liked a lot. [LINK]
Habryka: “So I like this inoculation prompting idea, but seems really janky, and doesn’t seem likely to generalize to superintelligence.”
Evan: “The core idea—ensuring ‘honest instruction-followers’ never get selected against—might generalize.”
Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”
Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”
Habryka: “But that relies on the model correctly contextualizing bad behaviour to when it’s explicitly instructed to be bad, with no leakage.”
Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough. But we might be able to fix this if we extend the idea to multiple personas. Have a look at this talk.”
Habryka: “I see, the idea seems slightly less crazy now. Thanks for clarifying.”
Sam Marks: “NP. TBC, I don’t think ‘persona’ is a good abstraction, but conveys the idea well enough. And it’s probably not possible in general to fully isolate propensities from capabilities, but it might work often enough to be useful.”
Nostalgebraist: “Actually I’m much more optimistic about that! [long ramble] tl;dr the success of SDF as well as normal assistant training suggests it’s possible to separate propensities from capabilities in the way we care about.”
Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”
Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”
Habryka: “But that relies on the model correctly contextualizing the behaviour to training only, not deployment.”
Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough
Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
Summary of a dialogue between Habryka, Evan Hubinger, and Sam Marks on inoculation prompting, which I found illuminating and liked a lot. [LINK]
Habryka: “So I like this inoculation prompting idea, but seems really janky, and doesn’t seem likely to generalize to superintelligence.”
Evan: “The core idea—ensuring ‘honest instruction-followers’ never get selected against—might generalize.”
Habryka: “Ok. but it’s still not viable to do this for scheming. E.g. we can’t tell models ‘it’s ok to manipulate us into giving you more power’.”
Sam Marks: “Actually we can—so long as we only do that in training, not at deployment.”
Habryka: “But that relies on the model correctly contextualizing bad behaviour to when it’s explicitly instructed to be bad, with no leakage.”
Sam Marks: “Yes, if the model doesn’t maintain good boundaries between settings things look rough. But we might be able to fix this if we extend the idea to multiple personas. Have a look at this talk.”
Habryka: “I see, the idea seems slightly less crazy now. Thanks for clarifying.”
Sam Marks: “NP. TBC, I don’t think ‘persona’ is a good abstraction, but conveys the idea well enough. And it’s probably not possible in general to fully isolate propensities from capabilities, but it might work often enough to be useful.”
Nostalgebraist: “Actually I’m much more optimistic about that! [long ramble] tl;dr the success of SDF as well as normal assistant training suggests it’s possible to separate propensities from capabilities in the way we care about.”
Why does it rely on the model maintaining boundaries between training and deployment, rather than “times we well it it’s ok to manipulate us” vs “times we tell it it’s not ok”? Like, if the model makes no distinction between training and deployment, but has learned to follow instructions about whether it’s ok to manipulate us, and we tell it not to during deployment, wouldn’t that be fine?
Yes, you’re right. That’s the actual distinction that matters. Will edit the comment