I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it’s very expensive if things break, so there’s rational resistance to added complexity.
Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don’t add, but my sense is that post-training isn’t that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).
And as you’re not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.
Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).
Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don’t expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You’d train a probe somewhat differently but I’m not sure it’s all that different
Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.
Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don’t add, but my sense is that post-training isn’t that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).
Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).
Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don’t expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You’d train a probe somewhat differently but I’m not sure it’s all that different
Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.
Why?