Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.