‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
Yes, I agree with this—and I’m mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term ‘inductive bias’.
OTOH, ‘generalization’ is in some sense the entire raison d’etre of the ML field. So I think it’s useful to draw on diverse sources of inspiration to inform this science. E.g.
Thinking abstractly about a neural net as parametrizing a policy. When is this policy ‘sticky’ vs ‘malleable’? We might want alignment propensities to be relatively ‘sticky’ or ‘locked-in’. Classical ML insights might tell us when NNs experience ‘loss of plasticity’, and we might be able to deliberately cause models to be aligned in a ‘locked-in’ way such that further finetuning wouldn’t compromise the alignment. (c.f. tamper-resistance)
What are the causal factors influencing how models select policies? I think initial state matters, i.e. if you were initially a lot more predisposed to a policy (in terms of there being existing circuitry you could repurpose) then you might learn that policy more easily. C.f. self-fulfilling misalignment, this toy experiment by Fabien Roger. A mature science here might let us easily predispose models to be aligned while not predisposing them to be misaligned—pretraining data ablations and character training seem like exciting ideas here.
When do we get general policies vs specific policies? We might want generalization sometimes (for capabilities) but not others (e.g. training models to do something ‘misaligned’ according to their original spec, without causing emergent misalignment). We could try to study toy models of generalization, like those considered in grokking. Can we do better than inoculation prompting for preventing models from becoming emergently misaligned?
So I’m pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a ‘more emergent’ framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
Yes, I agree with this—and I’m mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term ‘inductive bias’.
OTOH, ‘generalization’ is in some sense the entire raison d’etre of the ML field. So I think it’s useful to draw on diverse sources of inspiration to inform this science. E.g.
Thinking abstractly about a neural net as parametrizing a policy. When is this policy ‘sticky’ vs ‘malleable’? We might want alignment propensities to be relatively ‘sticky’ or ‘locked-in’. Classical ML insights might tell us when NNs experience ‘loss of plasticity’, and we might be able to deliberately cause models to be aligned in a ‘locked-in’ way such that further finetuning wouldn’t compromise the alignment. (c.f. tamper-resistance)
What are the causal factors influencing how models select policies? I think initial state matters, i.e. if you were initially a lot more predisposed to a policy (in terms of there being existing circuitry you could repurpose) then you might learn that policy more easily. C.f. self-fulfilling misalignment, this toy experiment by Fabien Roger. A mature science here might let us easily predispose models to be aligned while not predisposing them to be misaligned—pretraining data ablations and character training seem like exciting ideas here.
When do we get general policies vs specific policies? We might want generalization sometimes (for capabilities) but not others (e.g. training models to do something ‘misaligned’ according to their original spec, without causing emergent misalignment). We could try to study toy models of generalization, like those considered in grokking. Can we do better than inoculation prompting for preventing models from becoming emergently misaligned?
So I’m pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a ‘more emergent’ framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.