This paper looks interesting. My understanding is that this paper implemented a form of fine-tuning. However, learning the reward function does not seem to be few-shot whereas GPT3 does few-shot pretty well. That’s the main difference here as I see it.
It seems like there’s slow adaption (this paper) which is useful for more complicated tasks and fast adaption (the method here) that is useful for disposable tasks. I’d think a combination of both approaches is needed. For example, a module that tracks repeatedly occurring tasks can start a larger buffer to perform slow adaption.
Perhaps on a meta-level fine tuning GPT3 to few-shot inverse reinforcement learning would be an example of what could be possible with combining both approaches?