I agree with this, and I’d split your point into three separate factors.
The “30 hours to learn to drive” comparison hides at least:
(1) Pretraining: evolutionary pretraining of our visual/motor systems plus years of everyday world experience;
(2) Environment/institution design: a car/road ecosystem (infrastructure, norms, licensing) that has been iteratively redesigned for human drivers;
(3) Reward functions: they do matter for sample efficiency, but in this case they don’t seem to be the main driver of the gap.
Remove (1) and (2) and the picture changes: a blind person can’t realistically learn to drive safely in the current environment, and a new immigrant who speaks no English can’t pass the UK driving theory test without first learning English or Welsh, because of language policy, not because their brain’s reward function is worse.
A large part of the sample-efficiency gap here seems to be about pretraining and environment/institution design, rather than about humans having magically “better reward functions” inside a tabula-rasa learner.
Ethan Miller
Karma: 2
Thanks for the clarification and the links—My guess is that the real crux is how far “reward-function design” can really go, even in principle. If you have a very capable RL system in an open-ended environment and you heavily optimize a single scalar, then Goodhart / overoptimization / mesa-optimization tend to push towards extreme solutions. Under that picture, reward functions are just one axis in a larger design space (pretraining, environment/constraints, interpretability, multi-objective structure, etc.), and no cleverly designed scalar reward on its own looks like a stable solution.