Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain’s behavior away from the one where the probe hits the strongest by having the probe hit weakly
What do you mean “hit”? In humans we consciously experience rewards/pain. But that’s not the type of setup I’m imagining here. I mean, its certainly not the case with standard RL given the rewards and updates are computed after the model has stopped thinking and doing actions
Sure, that’s fine. It’s another example of the point I’m making. Note that even that setup has the potential for RL undermining the interp technique. What I’m saying is there are multiple ways to do this with more or less favorable ratios.
In ascending order of favorableness, something like
Letting backprop run through the probe
Doing scenario 1 / the neel nanda paper
Doing the goodfire thing with a frozen model
Doing the Byrnes thing where you have a architectural separation between the “belief” system and the “value learning” system, and not letting the reward computed from the belief system update the belief system directly.
I think doing the Byrnes thing will be quite hard with LLMs, or anything that uses the pretrain → instruct tune → post-train pipeline. But its not a counter example to the point I’m making
Finally, if you ask about “a principled way to reason about this ratio”, then one can consider a simple model.
I should be clear that the “ratio” I’m talking about is , but “reward hack” can obviously be replaced by any behavior.
What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire’s research is a case for the former?
I mean, I don’t think this is the right way to analyse it. We’re talking about the effect of training, not the resulting system in absolute terms. ie the difference between p_u before and after training.
Furthermore, we’re not really talking about p_u. We’re talking about the internals of a single model. I.e. a model that has various proclivites, and that sometimes behaves misaligned and other times not.
Not a model thats either completely aligned or completely unaligned, but we don’t know which.
What do you mean “hit”? In humans we consciously experience rewards/pain. But that’s not the type of setup I’m imagining here. I mean, its certainly not the case with standard RL given the rewards and updates are computed after the model has stopped thinking and doing actions
Sure, that’s fine. It’s another example of the point I’m making. Note that even that setup has the potential for RL undermining the interp technique. What I’m saying is there are multiple ways to do this with more or less favorable ratios.
In ascending order of favorableness, something like
Letting backprop run through the probe
Doing scenario 1 / the neel nanda paper
Doing the goodfire thing with a frozen model
Doing the Byrnes thing where you have a architectural separation between the “belief” system and the “value learning” system, and not letting the reward computed from the belief system update the belief system directly.
I think doing the Byrnes thing will be quite hard with LLMs, or anything that uses the pretrain → instruct tune → post-train pipeline. But its not a counter example to the point I’m making
I should be clear that the “ratio” I’m talking about is , but “reward hack” can obviously be replaced by any behavior.
I mean, I don’t think this is the right way to analyse it. We’re talking about the effect of training, not the resulting system in absolute terms. ie the difference between p_u before and after training.
Furthermore, we’re not really talking about p_u. We’re talking about the internals of a single model. I.e. a model that has various proclivites, and that sometimes behaves misaligned and other times not.
Not a model thats either completely aligned or completely unaligned, but we don’t know which.