Do you have any ideas for how to get interventions for larger “k” (the number of neurons acted on) to work better? Alternatively, do you have any ideas for how to train a model such that the necessary k, for any given trait/personality/whatever, is small? So that all/each important behavior could be controlled with a small number of neurons.
Do you have any ideas for how to get interventions for larger “k” to work better?
In my runs, small k already flips the decision boundary (sometimes k≈5 in late layers), so “going large” mostly comes up when the behavior is more distributed (e.g. safety / refusal dynamics). The main reason large-k gets worse is that δ·grad is a first-order (local) guide, but once you patch a lot of neurons you’re no longer in the local regime: effects stop adding because of nonlinearities and normalization / downstream compensation.
A couple ideas that might improve large-k stability (with caveats):
Iterative patching: we could patch a small chunk (e.g. 10–50 neurons), then recompute grads and δ·grad at the new state, repeat. This is computationally heavy (many backward passes) and can drift from “transfer source concepts” toward “optimize this logit contrast,” which may increase success while reducing interpretability.
Low-dim patching: we could take the top-k deltas and compress them into a few directions (PCA/SVD on δ vectors, or something gradient-weighted), then intervene along e.g. 1–10 directions. This can be more stable than many coordinate-wise patches, but it adds another abstraction layer and could make neuron-level interpretation harder.
Alternatively, do you have any ideas for how to train a model such that the necessary k, for any given trait/personality/whatever, is small? So that all/each important behavior could be controlled with a small number of neurons.
I don’t have strong hands-on skills in training (and my focus here was: “given dense/polysemantic models, can we still do useful causal work?”), so take this as informed speculation. That said i could think on architectural modularity like MoE-style routing, explicit control channels like style/control tokens or adapters, or representation regularizers that encourage sparsity/disentanglement. There have already been efforts to make small-k control more plausible via training, e.g. OpenAI’s work on weight-sparse transformers where most weights are forced to zero, yielding smaller, more disentangled circuits
Do you have any ideas for how to get interventions for larger “k” (the number of neurons acted on) to work better? Alternatively, do you have any ideas for how to train a model such that the necessary k, for any given trait/personality/whatever, is small? So that all/each important behavior could be controlled with a small number of neurons.
In my runs, small k already flips the decision boundary (sometimes k≈5 in late layers), so “going large” mostly comes up when the behavior is more distributed (e.g. safety / refusal dynamics). The main reason large-k gets worse is that δ·grad is a first-order (local) guide, but once you patch a lot of neurons you’re no longer in the local regime: effects stop adding because of nonlinearities and normalization / downstream compensation.
A couple ideas that might improve large-k stability (with caveats):
Iterative patching: we could patch a small chunk (e.g. 10–50 neurons), then recompute grads and δ·grad at the new state, repeat. This is computationally heavy (many backward passes) and can drift from “transfer source concepts” toward “optimize this logit contrast,” which may increase success while reducing interpretability.
Low-dim patching: we could take the top-k deltas and compress them into a few directions (PCA/SVD on δ vectors, or something gradient-weighted), then intervene along e.g. 1–10 directions. This can be more stable than many coordinate-wise patches, but it adds another abstraction layer and could make neuron-level interpretation harder.
I don’t have strong hands-on skills in training (and my focus here was: “given dense/polysemantic models, can we still do useful causal work?”), so take this as informed speculation. That said i could think on architectural modularity like MoE-style routing, explicit control channels like style/control tokens or adapters, or representation regularizers that encourage sparsity/disentanglement. There have already been efforts to make small-k control more plausible via training, e.g. OpenAI’s work on weight-sparse transformers where most weights are forced to zero, yielding smaller, more disentangled circuits