distribution shift: robustly pointing to the right goal/concepts when OOD or under extreme optimization pressure
Robustly going some distance off-distribution takes some work. It’s never possible to quickly jump arbitrarily far (maintaining robustness), concepts are always broken sufficiently far off-distribution, either by their natural definition, or until something is built there that takes the load.
So I think this is misleading to frame as something that can succeed at some point, rather than as something that needs to keep happening. That’s why instead of allowing the unclear extreme optimization pressure, optimization needs to respect the distribution where alignment has been achieved (as in base distribution of quantilizers), to deliberately avoid stepping off of it and plunging into uncharted waters.
And that requires maintaining knowledge of the scope of situations where model’s behavior is currently expected/assured to be aligned. I think this is not emphasized enough, that current scope of alignment needs to be a basic piece of data. Take this discussion of quantilizers, where formulating a base distribution is framed as a puzzle specific to quantilizers, rather than a core problem of alignment.
Robustly going some distance off-distribution takes some work. It’s never possible to quickly jump arbitrarily far (maintaining robustness), concepts are always broken sufficiently far off-distribution, either by their natural definition, or until something is built there that takes the load.
So I think this is misleading to frame as something that can succeed at some point, rather than as something that needs to keep happening. That’s why instead of allowing the unclear extreme optimization pressure, optimization needs to respect the distribution where alignment has been achieved (as in base distribution of quantilizers), to deliberately avoid stepping off of it and plunging into uncharted waters.
And that requires maintaining knowledge of the scope of situations where model’s behavior is currently expected/assured to be aligned. I think this is not emphasized enough, that current scope of alignment needs to be a basic piece of data. Take this discussion of quantilizers, where formulating a base distribution is framed as a puzzle specific to quantilizers, rather than a core problem of alignment.