I agree with this, and I’d split your point into three separate factors. The “30 hours to learn to drive” comparison hides at least: (1) Pretraining: evolutionary pretraining of our visual/motor systems plus years of everyday world experience; (2) Environment/institution design: a car/road ecosystem (infrastructure, norms, licensing) that has been iteratively redesigned for human drivers; (3) Reward functions: they do matter for sample efficiency, but in this case they don’t seem to be the main driver of the gap. Remove (1) and (2) and the picture changes: a blind person can’t realistically learn to drive safely in the current environment, and a new immigrant who speaks no English can’t pass the UK driving theory test without first learning English or Welsh, because of language policy, not because their brain’s reward function is worse. A large part of the sample-efficiency gap here seems to be about pretraining and environment/institution design, rather than about humans having magically “better reward functions” inside a tabula-rasa learner.
I agree with this, and I’d split your point into three separate factors.
The “30 hours to learn to drive” comparison hides at least:
(1) Pretraining: evolutionary pretraining of our visual/motor systems plus years of everyday world experience;
(2) Environment/institution design: a car/road ecosystem (infrastructure, norms, licensing) that has been iteratively redesigned for human drivers;
(3) Reward functions: they do matter for sample efficiency, but in this case they don’t seem to be the main driver of the gap.
Remove (1) and (2) and the picture changes: a blind person can’t realistically learn to drive safely in the current environment, and a new immigrant who speaks no English can’t pass the UK driving theory test without first learning English or Welsh, because of language policy, not because their brain’s reward function is worse.
A large part of the sample-efficiency gap here seems to be about pretraining and environment/institution design, rather than about humans having magically “better reward functions” inside a tabula-rasa learner.