Tbc this is not my position. I think that humans can do lots of things LLMs can’t, e.g. found and grow and run innovative companies from scratch, but not because of their reward functions. Likewise, I think a quite simple reward function would be sufficient for (misaligned) ASI with capabilities lightyears beyond both humans and today’s LLMs. I have some discussion here & here.
Thanks for the clarification and the links—My guess is that the real crux is how far “reward-function design” can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.
Tbc this is not my position. I think that humans can do lots of things LLMs can’t, e.g. found and grow and run innovative companies from scratch, but not because of their reward functions. Likewise, I think a quite simple reward function would be sufficient for (misaligned) ASI with capabilities lightyears beyond both humans and today’s LLMs. I have some discussion here & here.
Thanks for the clarification and the links—My guess is that the real crux is how far “reward-function design” can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.