Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the “perfect IRL” definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn’t defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn’t define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.
On the other hand, it’s actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like “maximize happiness” as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.
A few consequences of the above:
Under an “omnipotent experimenter” definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the agent.
It’s entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across all those experiments is the trivial utility function that’s constant across all states.
If the behavioral objective of a system is overdetermined, that might mean the system never pursues anything coherently. But it might also mean that there exist subsets of distributions on which the system pursues an objective very coherently, but that different distributions induce different coherent objectives.
The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent that’s consistent with the subset of distributions we’re conditioning on — then we will in general recover a set of mesa-objective candidates consistent with the system’s actions on that subset.
It is strictly incorrect to refer to “the” mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it’s underdetermined. (I believe Jack refers to this as a “perturbation set” downthread.)
This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the “generalization-focused approach” to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. (“When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?”)
Would definitely be interested in getting feedback on these thoughts!
Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the “perfect IRL” definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn’t defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn’t define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.
On the other hand, it’s actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like “maximize happiness” as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.
A few consequences of the above:
Under an “omnipotent experimenter” definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the agent.
It’s entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across all those experiments is the trivial utility function that’s constant across all states.
If the behavioral objective of a system is overdetermined, that might mean the system never pursues anything coherently. But it might also mean that there exist subsets of distributions on which the system pursues an objective very coherently, but that different distributions induce different coherent objectives.
The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent that’s consistent with the subset of distributions we’re conditioning on — then we will in general recover a set of mesa-objective candidates consistent with the system’s actions on that subset.
It is strictly incorrect to refer to “the” mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it’s underdetermined. (I believe Jack refers to this as a “perturbation set” downthread.)
This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the “generalization-focused approach” to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. (“When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?”)
Would definitely be interested in getting feedback on these thoughts!