habryka comments on It Is Reasonable To Research How To Use Model Internals In Training

habryka 8 Feb 2026 17:56 UTC
LW: 2 AF: 2
0
AF
I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it’s very expensive if things break, so there’s rational resistance to added complexity.
Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don’t add, but my sense is that post-training isn’t that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).
And as you’re not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.
Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).
- Neel Nanda 9 Feb 2026 4:37 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don’t expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You’d train a probe somewhat differently but I’m not sure it’s all that different
  
  Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.
- jacob_drori 8 Feb 2026 19:19 UTC
  1 point
  0
  Parent
  My guess is doing it with RL is worse because I expect it to generalize to more probes
  Why?