(I haven’t yet read the paper carefully). The main question of interest is: “How well can transformer do RL in-context after being trained to do so?” This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It’s possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it’s totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since you can’t empirically estimate the transition rules for the environment if past observations keep slipping out of your memory.)
There might be more clever approaches to in-context RL that can help get around the limitations on context window size. But I think I’m generally skeptical, and expect that capabilities due to things that look like in-context RL will be a rounding error compared to capabilities due to things that look like usual learning via SGD.
Regarding your question about how I’ve updated my beliefs: well, in-context RL wasn’t really a thing on my radar before reading this paper. But I think that if someone had brought in-context RL to my attention then I would have thought that context window constraints make it intractable (as I argued above). If someone had described the experiments in this paper to me, I think I would have strongly expected them to turn out the way they turned out. But I think I also would have objected that the experiments don’t shed light on the general viability of in-context RL, because the tasks seem specially selected to be solvable with small context windows. So in summary, I don’t think this paper has moved me very far from what I expect my beliefs would have been if I’d had some before reading the paper.
(I haven’t yet read the paper carefully). The main question of interest is: “How well can transformer do RL in-context after being trained to do so?” This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It’s possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.
In-context RL strikes me as a bit of a weird thing to do because of context window constraints. In more detail, in-context RL can only learn from experiences inside the context window (in this case, the last few episodes). This is enough to do well on extremely simple tasks, e.g. the tasks which appear in this paper, where even seeing one successful previous episode is enough to infer perfect play. But it’s totally insufficient for more complicated tasks, e.g. tasks in large, stochastic environments. (Stochasticity especially seems like a problem, since you can’t empirically estimate the transition rules for the environment if past observations keep slipping out of your memory.)
There might be more clever approaches to in-context RL that can help get around the limitations on context window size. But I think I’m generally skeptical, and expect that capabilities due to things that look like in-context RL will be a rounding error compared to capabilities due to things that look like usual learning via SGD.
Regarding your question about how I’ve updated my beliefs: well, in-context RL wasn’t really a thing on my radar before reading this paper. But I think that if someone had brought in-context RL to my attention then I would have thought that context window constraints make it intractable (as I argued above). If someone had described the experiments in this paper to me, I think I would have strongly expected them to turn out the way they turned out. But I think I also would have objected that the experiments don’t shed light on the general viability of in-context RL, because the tasks seem specially selected to be solvable with small context windows. So in summary, I don’t think this paper has moved me very far from what I expect my beliefs would have been if I’d had some before reading the paper.