This post made me re-visit the idea in your paper to distinguish between:
Internalization of the base objective (“The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned.”); and
Modeling of the base objective (“The base objective is incorporated into the mesa-optimizer’s epistemic model rather than its objective”).
I’m currently confused about this distinction. The phrase “point to” seems to me vague. What should count as a model that points to a representation of the base objective (as opposed to internalizing it)?
Suppose we have a model that is represented by a string of 10 billion bits. Suppose it is the case that there is a set of 100 bits such that if we flip all of them, the model would behave very differently (but would still be very “capable”, i.e. the modification would not just “break” it).
[EDIT: by “behave very differently” I mean something like “maximize some objective function that is far away from the base objective on objective function space”]
Is it theoretically possible that a model that fits this description is the result of internalization of the base objective rather than modeling of the base objective?
This post made me re-visit the idea in your paper to distinguish between:
Internalization of the base objective (“The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned.”); and
Modeling of the base objective (“The base objective is incorporated into the mesa-optimizer’s epistemic model rather than its objective”).
I’m currently confused about this distinction. The phrase “point to” seems to me vague. What should count as a model that points to a representation of the base objective (as opposed to internalizing it)?
Suppose we have a model that is represented by a string of 10 billion bits. Suppose it is the case that there is a set of 100 bits such that if we flip all of them, the model would behave very differently (but would still be very “capable”, i.e. the modification would not just “break” it).
[EDIT: by “behave very differently” I mean something like “maximize some objective function that is far away from the base objective on objective function space”]
Is it theoretically possible that a model that fits this description is the result of internalization of the base objective rather than modeling of the base objective?