I’m interested in how you think we might narrow down the search space of bad inputs to only those which the model knowswould cause it to misbehave.
In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don’t expect models to have a pre-cached list of all the situations they would behave badly—instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.
So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:
Seem like coherent / believable scenarios to the model.
Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
I do mostly think this requires whitebox methods to reliably solve, so it would be less “narrow down the search space” and more “search via a different parameterisation that takes advantage of whitebox knowledge”.
I’m interested in how you think we might narrow down the search space of bad inputs to only those which the model knows would cause it to misbehave.
In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don’t expect models to have a pre-cached list of all the situations they would behave badly—instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.
So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:
Seem like coherent / believable scenarios to the model.
Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
Result in bad behavior.
Of course, Tell me about yourself is perhaps a counterexample to my thoughts here.
I do mostly think this requires whitebox methods to reliably solve, so it would be less “narrow down the search space” and more “search via a different parameterisation that takes advantage of whitebox knowledge”.