During RL, we use this to split the gradient changes: changes orthogonal to (i.e abliterated by) this subspace go into the model weights, while the remaining changes are gradient routed into a large LoRA or similar structure (whose parameter count could be reduced since we know the specific subspace that its outputs will always lie in). We can monitor changes to the LoRA, both their magnitude over time, and in particular also monitor which environments these came from (for increases, likely insecure ones that reward hacking works on, for decreases presumably ones where it doesn’t work). If these get too large, we can halt training, fix the vulnerable environments, and restart from an earlier checkpoint. Or we can simply discard the LoRA after training.
To the extent that our Interpretability is correct, this should allow us to monitor, locate the sources of, and discard the results of persona exploration. To the extent that our Interpretability is imperfect, we are not incentivizing the model to learn a way around it, since we are not blocking what it wants to learn (as long as we use a large enough LoRA that this is not “cramped”), so this should not suffer from The Most Forbidden Technique problem: we are not giving the model an incentive to learn to fool our Interpretability during RL.
I’m working on locating the persona subspace, and understanding its contents better. The gradient routing application of this I leave to anyone sufficiently skilled at gradient routing, LoRA-like structures, and RL reasoning training. The sources I cite probably give enough data on subspaces to get started. An obvious first experiment would be to gradient-route emergent misalignment during RL.
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
To give a specific example:
Suppose we had some subspace of activation directions that we had identified using Interpretability techniques as relevant to persona change in general, or persona changes such as reward hacking under RL in particular (e.g the persona subspace found in The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models and/or the reward-hacking related Emergent Misalignment SAE latents identified in Appendix A of Persona Features Control Emergent Misalignment).
During RL, we use this to split the gradient changes: changes orthogonal to (i.e abliterated by) this subspace go into the model weights, while the remaining changes are gradient routed into a large LoRA or similar structure (whose parameter count could be reduced since we know the specific subspace that its outputs will always lie in). We can monitor changes to the LoRA, both their magnitude over time, and in particular also monitor which environments these came from (for increases, likely insecure ones that reward hacking works on, for decreases presumably ones where it doesn’t work). If these get too large, we can halt training, fix the vulnerable environments, and restart from an earlier checkpoint. Or we can simply discard the LoRA after training.
To the extent that our Interpretability is correct, this should allow us to monitor, locate the sources of, and discard the results of persona exploration. To the extent that our Interpretability is imperfect, we are not incentivizing the model to learn a way around it, since we are not blocking what it wants to learn (as long as we use a large enough LoRA that this is not “cramped”), so this should not suffer from The Most Forbidden Technique problem: we are not giving the model an incentive to learn to fool our Interpretability during RL.
Please work on this! Very interested in results from this technique and others like it.
I’m working on locating the persona subspace, and understanding its contents better. The gradient routing application of this I leave to anyone sufficiently skilled at gradient routing, LoRA-like structures, and RL reasoning training. The sources I cite probably give enough data on subspaces to get started. An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).