When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
do you think longer horizon rl can teach them this?
LLMs have some in context learning abilities. but I think for most of what they do (like solving IMO problems or writing one off programs), they can get by mostly relying on the knowledge in their weights.
But as RL trajectories get longer, there’s more and more pressure on the model learning things over a single rollout.
do you think longer horizon rl can teach them this?
LLMs have some in context learning abilities. but I think for most of what they do (like solving IMO problems or writing one off programs), they can get by mostly relying on the knowledge in their weights.
But as RL trajectories get longer, there’s more and more pressure on the model learning things over a single rollout.