The arguments here match a lot of my own intuitions. I want to add a few things:
1. Even benchmarks that supposedly measure sample efficiency on abstract problems fall victim to human priors: ARC Is a Vision Problem! - https://arxiv.org/abs/2511.14761
2. Human learning largely performed on-policy, while pretraining is primarily off-policy. This means that humans can seek out the information that they specifically lack, and receive feedback to address their specific mistakes. I predict that the shift towards RL (and more recently on-policy distillation) is the first phase of a broader transition towards primarily on-policy training pipelines that will bring gains in both sample and parameter efficiency.
The arguments here match a lot of my own intuitions. I want to add a few things:
1. Even benchmarks that supposedly measure sample efficiency on abstract problems fall victim to human priors: ARC Is a Vision Problem! - https://arxiv.org/abs/2511.14761
2. Human learning largely performed on-policy, while pretraining is primarily off-policy. This means that humans can seek out the information that they specifically lack, and receive feedback to address their specific mistakes. I predict that the shift towards RL (and more recently on-policy distillation) is the first phase of a broader transition towards primarily on-policy training pipelines that will bring gains in both sample and parameter efficiency.