[Very quick take; I haven’t thought much about this]
In some plausible futures, the current pragmatic alignment strategy (constitutional AI, deliberative alignment, RLHF, etc) continues working to and at least a little bit past AGI. As I see it, that approach sketches out traits or behaviors that we want the model to have, and then relies on the model to generalize that sketch in some reasonable way. As far as I know, that isn’t a very precise process; different models from the same company have somewhat different personalities, traits, etc, in a way that seems at least partly unintended.
It seems likely that there are some values and traits that models usually generalize well when you point them in the right general direction, and others that require more precise pointing-to. This is an area I’d like to see some work in: entirely aside from the question of what values and traits we want models to have, which ones are easier or harder to specify to models, and what characteristics correlate with that?
[Very quick take; I haven’t thought much about this]
In some plausible futures, the current pragmatic alignment strategy (constitutional AI, deliberative alignment, RLHF, etc) continues working to and at least a little bit past AGI. As I see it, that approach sketches out traits or behaviors that we want the model to have, and then relies on the model to generalize that sketch in some reasonable way. As far as I know, that isn’t a very precise process; different models from the same company have somewhat different personalities, traits, etc, in a way that seems at least partly unintended.
It seems likely that there are some values and traits that models usually generalize well when you point them in the right general direction, and others that require more precise pointing-to. This is an area I’d like to see some work in: entirely aside from the question of what values and traits we want models to have, which ones are easier or harder to specify to models, and what characteristics correlate with that?