ryan_greenblatt comments on Training AI to do alignment research we don’t already know how to do

ryan_greenblatt 25 Feb 2025 22:37 UTC
4 points
0
FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.

but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it ~~behave well~~ perform well according to your metrics in the average case using online training.
- Jeremy Gillen 25 Feb 2025 23:49 UTC
  2 points
  0
  Parent
  I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
  I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
  Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
  FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
  I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?
  - ryan_greenblatt 26 Feb 2025 2:28 UTC
    2 points
    0
    Parent
    Oh, yeah I meant “perform well according to your metrics” not “behave well” (edited)