CharlotteS comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

CharlotteS 28 Jan 2023 22:55 UTC
1 point
0
Thanks for running this.
My understanding is that reward specification and goal misgeneralisation are supposed to be synonymous words for outer and inner alignment (?) I understand the problem of inner alignment to be about mesa-optimization. I don’t understand how the two papers on goal misgeneralisation fit in:
1. https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
2. https://arxiv.org/abs/2105.14111
(in both cases, there are no real mesa optimisers. It is just like the base objective is a set of several possible goals and the agents found one of these goals. This seems so obvious (especially as the optimisation pressure is not that high) - if you underspecify the goal, then different goals could emerge.
Why are these papers not called reward misspecification or an outer alignment failure?