Just remembered this comment—my recent paper is more or less the “top level post” I was talking about!
Naively, it is in fact hard for prompt optimization to keep up capabilities-wise. I think “replacing RL entirely” may be too ambitious, but it’s possible to compromise and do some combination of latent learning and legible learning.
See also my recent shortform for one version of this less-ambitious plan. Even if LLM post-training relies heavily on RL, we should try to do continual learning (i.e. user customization and “learning on the job”) with prompts and code. This is an easier ask, since the labs are already basically doing this!
Just remembered this comment—my recent paper is more or less the “top level post” I was talking about!
Naively, it is in fact hard for prompt optimization to keep up capabilities-wise. I think “replacing RL entirely” may be too ambitious, but it’s possible to compromise and do some combination of latent learning and legible learning.
See also my recent shortform for one version of this less-ambitious plan. Even if LLM post-training relies heavily on RL, we should try to do continual learning (i.e. user customization and “learning on the job”) with prompts and code. This is an easier ask, since the labs are already basically doing this!