FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
but you do get a machine that pursues the metric that you fine-tuned it to pursue, even out of distribution (with a relatively small amount of data).
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?
FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it
behave wellperform well according to your metrics in the average case using online training.I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?
Oh, yeah I meant “perform well according to your metrics” not “behave well” (edited)