When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed.
Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I’m around 63% P(the aggregate network still learns the bad behavior significantly—let’s say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient’ed weights beat filtering by at least −0.2 logprob on held-out bad behavior), so that the designated weights can’t distill into layers before them.
this seems reachable with practically-sized experiments, though edit: I misread, you’re describing experiments that have already been done.
Seems like the main crux. I expect gradients flowing back through the designated weights will self-distill the insights into earlier layers; I’m around 63% P(the aggregate network still learns the bad behavior significantly—let’s say at least +0.2 logprob more than filtering). putting a stop-gradient behind the designated weights might help? P(stop-gradient’ed weights beat filtering by at least −0.2 logprob on held-out bad behavior), so that the designated weights can’t distill into layers before them.
this seems reachable with practically-sized experiments, thoughedit: I misread, you’re describing experiments that have already been done.