LawrenceC comments on Sanity-checking “Incompressible Knowledge Probes”

LawrenceC 2 May 2026 9:40 UTC
2 points
0
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.