williawa comments on The Future of Aligning Deep Learning systems will probably look like “training on interp”

williawa 21 Mar 2026 10:53 UTC
1 point
0
Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain’s behavior away from the one where the probe hits the strongest by having the probe hit weakly
What do you mean “hit”? In humans we consciously experience rewards/pain. But that’s not the type of setup I’m imagining here. I mean, its certainly not the case with standard RL given the rewards and updates are computed after the model has stopped thinking and doing actions
Additionally, Byrnes’ case for interpretability in the loop doesn’t actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.
Sure, that’s fine. It’s another example of the point I’m making. Note that even that setup has the potential for RL undermining the interp technique. What I’m saying is there are multiple ways to do this with more or less favorable ratios.
In ascending order of favorableness, something like
1. Letting backprop run through the probe
2. Doing scenario 1 / the neel nanda paper
3. Doing the goodfire thing with a frozen model
4. Doing the Byrnes thing where you have a architectural separation between the “belief” system and the “value learning” system, and not letting the reward computed from the belief system update the belief system directly.
I think doing the Byrnes thing will be quite hard with LLMs, or anything that uses the pretrain → instruct tune → post-train pipeline. But its not a counter example to the point I’m making
Finally, if you ask about “a principled way to reason about this ratio”, then one can consider a simple model.
I should be clear that the “ratio” I’m talking about is , but “reward hack” can obviously be replaced by any behavior.
What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire’s research is a case for the former?
I mean, I don’t think this is the right way to analyse it. We’re talking about the effect of training, not the resulting system in absolute terms. ie the difference between p_u before and after training.
Furthermore, we’re not really talking about p_u. We’re talking about the internals of a single model. I.e. a model that has various proclivites, and that sometimes behaves misaligned and other times not.
Not a model thats either completely aligned or completely unaligned, but we don’t know which.