Neel Nanda comments on It Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda 8 Feb 2026 11:56 UTC
LW: 4 AF: 3
0
AF
Ok, that one in particularly would only be fairly annoying rather than very annoying, fair point. You would need either need to have your training infra set up to allow you to apply a probe while sampling, which sounds annoying, or to rerun after sampling and apply a probe the second time. The latter is easier infra wise as you could run it in inference only mode but still adds significant overhead as you need to run it again, even if it doesn’t involve any generation, and plausibly with a somewhat different model configuration since you want to access activations. But implementing probes in the serving stack is doable so it seems so this does seem doable, and since you’re not using the internals in a way that directly interacts with backdrop, it seems much easier. I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it’s very expensive if things break, so there’s rational resistance to added complexity. And as you’re not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.

I think to properly break things this would likely also need to regularly retrain the probe, and that sounds very annoying, since you would need to now do a complex non standard operation in the middle of training. Unless there’s some creative way you can detect probe errors on the activations generated during training and update in an online way based on these?
- leogao 9 Feb 2026 7:38 UTC
  LW: 12 AF: 7
  3
  AF Parent
  I think the infrastructure changes required are pretty straightforward, at least within the reference class of changes on large ML codebases in general. like it would take me at most a few days to implement. if done in a reasonable way, it also seems very low risk for breaking other things if done right (the probe uses trivial memory and compute, so you wouldn’t expect it to substantially interact systems wise). regularly retraining the probe is also not really that bad imo.
- habryka 8 Feb 2026 17:56 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it’s very expensive if things break, so there’s rational resistance to added complexity.
  Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don’t add, but my sense is that post-training isn’t that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).
  And as you’re not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.
  Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).
  - Neel Nanda 9 Feb 2026 4:37 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don’t expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You’d train a probe somewhat differently but I’m not sure it’s all that different
    
    Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.
  - jacob_drori 8 Feb 2026 19:19 UTC
    1 point
    0
    Parent
    My guess is doing it with RL is worse because I expect it to generalize to more probes
    Why?