Jeremy Gillen comments on Training AI to do alignment research we don’t already know how to do

Jeremy Gillen 25 Feb 2025 3:09 UTC
10 points
5
to the extent developers succeed in creating faithful simulators
There’s a crux I have with Ryan which is “whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well”. As of last time we talked about it, Ryan says we probably will, I say we probably won’t.
If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that’s pretty much it. This is one end of the spectrum on this question.
There are weaker versions, which I think are what Ryan believes will be possible. In a slightly weaker case, you don’t get something anywhere close to a human simulation, but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).
But I think the evidence is against this. Long horizon tasks are currently difficult to successfully train on, unless you have dense intermediate feedback. Capabilities progress in the last decade has come from leaning heavily on dense intermediate feedback.
I expect long-horizon RL to remain pretty low data efficiency (i.e. take a lot of data before it generalizes well OOD).
@ryan_greenblatt
- ryan_greenblatt 25 Feb 2025 22:37 UTC
  4 points
  0
  Parent
  FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
  
  but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).
  
  I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it ~~behave well~~ perform well according to your metrics in the average case using online training.
  - Jeremy Gillen 25 Feb 2025 23:49 UTC
    2 points
    0
    Parent
    I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
    I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
    Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
    FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
    I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?
    - ryan_greenblatt 26 Feb 2025 2:28 UTC
      2 points
      0
      Parent
      Oh, yeah I meant “perform well according to your metrics” not “behave well” (edited)