That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best.
What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best? It seems like an obvious next step to take in order to build agents that generalize better than current RL agents, doesn’t it? Is it just that the challenges they’ve attempted so far haven’t required going beyond building agents that are essentially just lossy compressions of behaviors that work well on the training distribution, or is there a fundamental reason why using RL to train goal-directed agents would be hard?
What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best?
That technique is called model-based RL, and in practice, given sufficient data and compute, it ends up performing worse than model-free RL. (It does perform better in low-data regimes, and my guess is that it will also generalize slightly better but not much.) In model-based RL, you learn a model of the world, and then search over sequences of actions and take the one that seems best.
Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).
It is worth noting that this isn’t end-to-end: the model is trained “end-to-end”, but the action selection is typically some hardcoded function like “sample 1000 trajectories from the model, choose the trajectory that gives the best reward, and take the first action of that trajectory”. I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions (as opposed to an implicit search that model-free RL algorithms might already be doing).
When you are given an accurate model of the world, then you can in fact search over actions and do much better, see for example value iteration or policy iteration. (Those are for very small environments, but you could create approximate versions for more complex environments.)
Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).
Interesting, I wonder how humans avoid generating nonsense actions like this.
I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions
I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.
(as opposed to an implicit search that model-free RL algorithms might already be doing)
What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Interesting, I wonder how humans avoid generating nonsense actions like this.
Some hypotheses that are very speculative:
Something something explicit reasoning?
Our environment is sufficiently harsh and complex that everything is in-distribution
Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
What kind of implicit search are model-free RL algorithms already doing?
I don’t know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.
My intuition says “nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm”. But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don’t trust my intuition on what that level is.
If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.
What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best? It seems like an obvious next step to take in order to build agents that generalize better than current RL agents, doesn’t it? Is it just that the challenges they’ve attempted so far haven’t required going beyond building agents that are essentially just lossy compressions of behaviors that work well on the training distribution, or is there a fundamental reason why using RL to train goal-directed agents would be hard?
That technique is called model-based RL, and in practice, given sufficient data and compute, it ends up performing worse than model-free RL. (It does perform better in low-data regimes, and my guess is that it will also generalize slightly better but not much.) In model-based RL, you learn a model of the world, and then search over sequences of actions and take the one that seems best.
Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).
It is worth noting that this isn’t end-to-end: the model is trained “end-to-end”, but the action selection is typically some hardcoded function like “sample 1000 trajectories from the model, choose the trajectory that gives the best reward, and take the first action of that trajectory”. I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions (as opposed to an implicit search that model-free RL algorithms might already be doing).
When you are given an accurate model of the world, then you can in fact search over actions and do much better, see for example value iteration or policy iteration. (Those are for very small environments, but you could create approximate versions for more complex environments.)
Interesting, I wonder how humans avoid generating nonsense actions like this.
I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.
What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Some hypotheses that are very speculative:
Something something explicit reasoning?
Our environment is sufficiently harsh and complex that everything is in-distribution
Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
I don’t know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.
My intuition says “nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm”. But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don’t trust my intuition on what that level is.
Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.