As with alternatives to utility functions, the practical solution seems to be to avoid explicit optimization (which is known to do the wrong things), and instead work on model-generated behaviors in other ways, without getting them explicitly reshaped into optimization. If there is no good theory of optimization (that doesn’t predictably only do the wrong things), it needs to be kept out of the architecture, so that it’s up to the system to come up with optimization it decides on later, when it grows up. What an architecture needs to ensure is clarity of aligned cognition sufficient to eventually make decisions like that, not optimization (of the world) directly.
As with alternatives to utility functions, the practical solution seems to be to avoid explicit optimization (which is known to do the wrong things), and instead work on model-generated behaviors in other ways, without getting them explicitly reshaped into optimization. If there is no good theory of optimization (that doesn’t predictably only do the wrong things), it needs to be kept out of the architecture, so that it’s up to the system to come up with optimization it decides on later, when it grows up. What an architecture needs to ensure is clarity of aligned cognition sufficient to eventually make decisions like that, not optimization (of the world) directly.