RLHF and RLVR are model-free: they update the policy from sampled real trajectories, with no learned model of the environment to plan or imagine inside.
It’s perhaps worth distinguishing the learned reward model of RLHF—so the policy isn’t updated only from ‘real’ trajectories but also from exploratory hypothetical trajectories scored against the learned reward model.
In RLVR, one could claim that the whole point is that we have the ‘model’ - the training environment—and it’s arguably not only more accurate to just run that directly rather than imagining it, but also cheaper in many cases (e.g. it’s shell scripts and file manipulations, not a giant NN world model). Not always: some environment interactions might be very slow (e.g. running a big gradient descent). You write “AI labs have focused their RL work on areas where the environment and verification step are cheaper to compute directly rather than in a learned world model”, which acknowledges this.
One thing you haven’t mentioned (which I think is quite crucial in human sample-efficient planning and learning) is temporal abstraction. That’s something that only some world models can do (ones which flexibly represent ‘options’ at various temporal granularities rather than merely representing atomic ‘action’ transitions). Humans definitely do this. I’m fairly confident that the implicit world models baked into large pretrained NNs also do this to some extent in their internal planning. I haven’t seen many architectures which allow such a temporally abstract world model to interface with policy (e.g. in actor-critic/dreamer style) to allow improved sample-efficiency. EfficientZero, with its ‘value prefix’ model, comes closest (and not that close) among things I’ve paid attention to. Notably, language encodes all kinds of temporal abstraction essentially arbitrarily flexibly (this realisation was what prompted me onto the LLM-agent train back in 2021).
Given working temporal abstraction, even RLVR on cheap envs might be substantially accelerated.
Uh, maybe. There’s some scattered stuff. I’m not sure if anyone else is thinking about it the way I do, but probably it’s some of the ‘hierarchical’ (keword) RL/control and robotics people if anyone. e.g.
It’s perhaps worth distinguishing the learned reward model of RLHF—so the policy isn’t updated only from ‘real’ trajectories but also from exploratory hypothetical trajectories scored against the learned reward model.
In RLVR, one could claim that the whole point is that we have the ‘model’ - the training environment—and it’s arguably not only more accurate to just run that directly rather than imagining it, but also cheaper in many cases (e.g. it’s shell scripts and file manipulations, not a giant NN world model). Not always: some environment interactions might be very slow (e.g. running a big gradient descent). You write “AI labs have focused their RL work on areas where the environment and verification step are cheaper to compute directly rather than in a learned world model”, which acknowledges this.
One thing you haven’t mentioned (which I think is quite crucial in human sample-efficient planning and learning) is temporal abstraction. That’s something that only some world models can do (ones which flexibly represent ‘options’ at various temporal granularities rather than merely representing atomic ‘action’ transitions). Humans definitely do this. I’m fairly confident that the implicit world models baked into large pretrained NNs also do this to some extent in their internal planning. I haven’t seen many architectures which allow such a temporally abstract world model to interface with policy (e.g. in actor-critic/dreamer style) to allow improved sample-efficiency. EfficientZero, with its ‘value prefix’ model, comes closest (and not that close) among things I’ve paid attention to. Notably, language encodes all kinds of temporal abstraction essentially arbitrarily flexibly (this realisation was what prompted me onto the LLM-agent train back in 2021).
Given working temporal abstraction, even RLVR on cheap envs might be substantially accelerated.
Can you point me towards more reading on the temporal abstraction thing?
Uh, maybe. There’s some scattered stuff. I’m not sure if anyone else is thinking about it the way I do, but probably it’s some of the ‘hierarchical’ (keword) RL/control and robotics people if anyone. e.g.
Planning in a hierarchy of abstraction spaces
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
Option-Critic
DeepMimic
Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning
Model-based hierarchical reinforcement learning and human action control