This idea has been around for some time, known as indirect normativity. The variant you describe is also my own preferred formulation at the time. For a few years it was a major motivation for studying decision theory for me, since this still needs the outer AGI to actually run the program, and ideally also yield control to that program eventually, when the program figures out its values and can slot them in as the values of the outer AGI.
This doesn’t work out for several reasons. We don’t actually have a way of creating the goal program. The most straightforward thing would be to use an upload, but that probably can’t be done before AGIs.
If we do have a sensible human imitation, then the thing to do with it is to build an HCH that pushes the goodhart boundary of that human imitation and allows stronger optimization of the world that doesn’t break down our ability to assess its value. This gives the first aligned AGI directly, without turning the world into computronium.
Even if we did make a goal program, it’s still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs. It’s not even known what kind of thing goals are, the type signature of that program that’s needed to communicate the goals from the goal program to the outer AGI.
Even if we mostly knew how to build the outer AGI that runs a goal program (though with some confusion around the notion of goals still remaining), it’s unclear that there are normative goals for humanity that are goals in a sense similar to a utility function in expected utility maximization, goals for a strong optimizer. We might want to discover such goals withreflection, but that doesn’t necessarily reach a conclusion, as reflection is unbounded.
More likely, there is just a sequence of increasingly accurate proxy goals with increasingly wide goodhart boundaries, instructing a mild optimizer how to act on states of the world it is able to assess. But then the outer AGI must already be a mild optimizer and not a predatory mature optimizer that ignores all boundaries of what’s acceptable in pursuit of the goal it knows (in this case, the goal program).
This sets up motivation for what I currently see as valuable on the decision theory side: figuring out a principled way of doing mild optimization (there’s only quantilization in this space at the moment). It should probably take something like goodhart boundary as a fundamental ingredient of its operation (it seems related to the base distribution of quantilization), the kind of thing that’s traditionally missing from decision theory.
Even if we did make a goal program, it’s still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.
Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics “which programs should run”. Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].
We should be careful to prevent the inhabitants of the virtual utopia from creating unaligned AI which eats the utopia. This sounds achievable, assuming the premise that we can actually construct such programs.
This idea has been around for some time, known as indirect normativity. The variant you describe is also my own preferred formulation at the time. For a few years it was a major motivation for studying decision theory for me, since this still needs the outer AGI to actually run the program, and ideally also yield control to that program eventually, when the program figures out its values and can slot them in as the values of the outer AGI.
This doesn’t work out for several reasons. We don’t actually have a way of creating the goal program. The most straightforward thing would be to use an upload, but that probably can’t be done before AGIs.
If we do have a sensible human imitation, then the thing to do with it is to build an HCH that pushes the goodhart boundary of that human imitation and allows stronger optimization of the world that doesn’t break down our ability to assess its value. This gives the first aligned AGI directly, without turning the world into computronium.
Even if we did make a goal program, it’s still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs. It’s not even known what kind of thing goals are, the type signature of that program that’s needed to communicate the goals from the goal program to the outer AGI.
Even if we mostly knew how to build the outer AGI that runs a goal program (though with some confusion around the notion of goals still remaining), it’s unclear that there are normative goals for humanity that are goals in a sense similar to a utility function in expected utility maximization, goals for a strong optimizer. We might want to discover such goals with reflection, but that doesn’t necessarily reach a conclusion, as reflection is unbounded.
More likely, there is just a sequence of increasingly accurate proxy goals with increasingly wide goodhart boundaries, instructing a mild optimizer how to act on states of the world it is able to assess. But then the outer AGI must already be a mild optimizer and not a predatory mature optimizer that ignores all boundaries of what’s acceptable in pursuit of the goal it knows (in this case, the goal program).
This sets up motivation for what I currently see as valuable on the decision theory side: figuring out a principled way of doing mild optimization (there’s only quantilization in this space at the moment). It should probably take something like goodhart boundary as a fundamental ingredient of its operation (it seems related to the base distribution of quantilization), the kind of thing that’s traditionally missing from decision theory.
Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics “which programs should run”. Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].
We should be careful to prevent the inhabitants of the virtual utopia from creating unaligned AI which eats the utopia. This sounds achievable, assuming the premise that we can actually construct such programs.