Jeremy Gillen comments on Training AI to do alignment research we don’t already know how to do

Jeremy Gillen 25 Feb 2025 23:49 UTC
2 points
0
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?
- ryan_greenblatt 26 Feb 2025 2:28 UTC
  2 points
  0
  Parent
  Oh, yeah I meant “perform well according to your metrics” not “behave well” (edited)