When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use Opt to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use Opt to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)