Ofer comments on Aligning a toy model of optimization

Ofer 1 Jul 2019 18:56 UTC
3 points
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use $O p t$ to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
- paulfchristiano 2 Jul 2019 21:55 UTC
  2 points
  Parent
  When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
  If so, I’m concerned that we won’t be able to write such a program.
  That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)