I believe there’s a lot of existing ML research into inductive bias in neural networks...
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)
...but my understanding (without really being familiar with that literature) was that ‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
I’m interested in whether my understanding is wrong, vs you using ‘inductive bias’ as a metaphor for this broader sort of generalization, vs you believing that high-level properties like ‘scheming’ or ‘alignment with developer intent’ can be cashed out in a way that’s amenable to low-level inductive bias.
PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you’ll share it here!
‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
Yes, I agree with this—and I’m mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term ‘inductive bias’.
OTOH, ‘generalization’ is in some sense the entire raison d’etre of the ML field. So I think it’s useful to draw on diverse sources of inspiration to inform this science. E.g.
Thinking abstractly about a neural net as parametrizing a policy. When is this policy ‘sticky’ vs ‘malleable’? We might want alignment propensities to be relatively ‘sticky’ or ‘locked-in’. Classical ML insights might tell us when NNs experience ‘loss of plasticity’, and we might be able to deliberately cause models to be aligned in a ‘locked-in’ way such that further finetuning wouldn’t compromise the alignment. (c.f. tamper-resistance)
What are the causal factors influencing how models select policies? I think initial state matters, i.e. if you were initially a lot more predisposed to a policy (in terms of there being existing circuitry you could repurpose) then you might learn that policy more easily. C.f. self-fulfilling misalignment, this toy experiment by Fabien Roger. A mature science here might let us easily predispose models to be aligned while not predisposing them to be misaligned—pretraining data ablations and character training seem like exciting ideas here.
When do we get general policies vs specific policies? We might want generalization sometimes (for capabilities) but not others (e.g. training models to do something ‘misaligned’ according to their original spec, without causing emergent misalignment). We could try to study toy models of generalization, like those considered in grokking. Can we do better than inoculation prompting for preventing models from becoming emergently misaligned?
So I’m pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a ‘more emergent’ framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
I believe there’s a lot of existing ML research into inductive bias in neural networks...
...but my understanding (without really being familiar with that literature) was that ‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
I’m interested in whether my understanding is wrong, vs you using ‘inductive bias’ as a metaphor for this broader sort of generalization, vs you believing that high-level properties like ‘scheming’ or ‘alignment with developer intent’ can be cashed out in a way that’s amenable to low-level inductive bias.
PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you’ll share it here!
Yes, I agree with this—and I’m mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term ‘inductive bias’.
OTOH, ‘generalization’ is in some sense the entire raison d’etre of the ML field. So I think it’s useful to draw on diverse sources of inspiration to inform this science. E.g.
Thinking abstractly about a neural net as parametrizing a policy. When is this policy ‘sticky’ vs ‘malleable’? We might want alignment propensities to be relatively ‘sticky’ or ‘locked-in’. Classical ML insights might tell us when NNs experience ‘loss of plasticity’, and we might be able to deliberately cause models to be aligned in a ‘locked-in’ way such that further finetuning wouldn’t compromise the alignment. (c.f. tamper-resistance)
What are the causal factors influencing how models select policies? I think initial state matters, i.e. if you were initially a lot more predisposed to a policy (in terms of there being existing circuitry you could repurpose) then you might learn that policy more easily. C.f. self-fulfilling misalignment, this toy experiment by Fabien Roger. A mature science here might let us easily predispose models to be aligned while not predisposing them to be misaligned—pretraining data ablations and character training seem like exciting ideas here.
When do we get general policies vs specific policies? We might want generalization sometimes (for capabilities) but not others (e.g. training models to do something ‘misaligned’ according to their original spec, without causing emergent misalignment). We could try to study toy models of generalization, like those considered in grokking. Can we do better than inoculation prompting for preventing models from becoming emergently misaligned?
So I’m pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a ‘more emergent’ framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.