Thanks for the clarification and the links—My guess is that the real crux is how far “reward-function design” can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.
Thanks for the clarification and the links—My guess is that the real crux is how far “reward-function design” can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.