Understanding and Controlling LLM Generalization
A distillation of my long-term research agenda and current thinking. I welcome takes on this.
Why study generalization?
I’m interested in studying how LLMs generalise—when presented with multiple policies that achieve similar loss, which ones tend to be learned by default?
I claim this is pretty important for AI safety:
Re: developing safe general intelligence, we will never be able to train LLM on all the contexts it will see at deployment. To prevent goal misgeneralization, it’s necessary to understand how LLMs generalise their training OOD.
Re: loss of control risks specifically, certain important kinds of misalignment (reward hacking, scheming) are difficult to ‘select against’ at the behavioural level. A fallback for this would be if LLMs had an innate ‘generalization propensity’ to learn aligned policies over misaligned ones.
This motivates research into LLM inductive biases. Or as I’ll call them from here on, ‘generalization propensities’.
I have two high-level goals:
Understanding the complete set of causal factors that drive generalization.
Controlling generalization by intervening on these causal factors in a principled way.
Defining “generalization propensity”
To study generalization propensities, we need two things:
“Generalization propensity evaluations” (GPEs)
Training-time interventions
I define a GPE as a way to measure how models generalise OOD from weak supervision signal. Minimally, this consists of a bundled (narrow training signal, object-level trait eval). My go-to example is emergent misalignment and other types of misalignment generalization. Obviously it’s good to get as close as possible to the kinds of misaligned policies outlined above.
I define a training-time intervention as any way we can consider modifying the training process to change an LLM’s inductive biases. This includes things like character training, filtering the pretraining data, conditional pretraining, gradient routing, and inoculation prompting, among others.
Research questions
Some broad and overlapping things I’m interested in are:
What are models’ generalization propensities? Let’s accumulate a diverse suite of GPEs, each including a training signal + trait eval, and do something akin to ‘personality profiling’
What kinds of interventions are effective at changing models’ generalization propensities? Let’s test lots of them, see what happens.
How do different interventions compose? E.g. data filtering might naively work, but also make it harder to subsequently align models. What does the best ‘full stack’ intervention look like?
Ambitiously, can we instill generalization propensities robustly? Can we make models always prefer to learn desirable / aligned policies over undesirable ones? Can this be made tamper-resistant?
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)
I believe there’s a lot of existing ML research into inductive bias in neural networks...
...but my understanding (without really being familiar with that literature) was that ‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
I’m interested in whether my understanding is wrong, vs you using ‘inductive bias’ as a metaphor for this broader sort of generalization, vs you believing that high-level properties like ‘scheming’ or ‘alignment with developer intent’ can be cashed out in a way that’s amenable to low-level inductive bias.
PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you’ll share it here!
Yes, I agree with this—and I’m mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term ‘inductive bias’.
OTOH, ‘generalization’ is in some sense the entire raison d’etre of the ML field. So I think it’s useful to draw on diverse sources of inspiration to inform this science. E.g.
Thinking abstractly about a neural net as parametrizing a policy. When is this policy ‘sticky’ vs ‘malleable’? We might want alignment propensities to be relatively ‘sticky’ or ‘locked-in’. Classical ML insights might tell us when NNs experience ‘loss of plasticity’, and we might be able to deliberately cause models to be aligned in a ‘locked-in’ way such that further finetuning wouldn’t compromise the alignment. (c.f. tamper-resistance)
What are the causal factors influencing how models select policies? I think initial state matters, i.e. if you were initially a lot more predisposed to a policy (in terms of there being existing circuitry you could repurpose) then you might learn that policy more easily. C.f. self-fulfilling misalignment, this toy experiment by Fabien Roger. A mature science here might let us easily predispose models to be aligned while not predisposing them to be misaligned—pretraining data ablations and character training seem like exciting ideas here.
When do we get general policies vs specific policies? We might want generalization sometimes (for capabilities) but not others (e.g. training models to do something ‘misaligned’ according to their original spec, without causing emergent misalignment). We could try to study toy models of generalization, like those considered in grokking. Can we do better than inoculation prompting for preventing models from becoming emergently misaligned?
So I’m pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a ‘more emergent’ framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.