When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning. That’s the part I’m skeptical of.
do you think longer horizon rl can teach them this?
LLMs have some in context learning abilities. but I think for most of what they do (like solving IMO problems or writing one off programs), they can get by mostly relying on the knowledge in their weights.
But as RL trajectories get longer, there’s more and more pressure on the model learning things over a single rollout.
Whats your beliefs now?
When I read the HRM paper and this post I was somewhat alarmed, but not alarmed enough to pay close attention. Did your predictions come true? If so why or why not?