It doesn’t contradict Turntrout’s post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it’s smart and knowledgeable enough to do so.
(This comment is another example of how Turntrout’s post was a misfire because everyone takes away the opposite of what they should have.)
It doesn’t contradict Turntrout’s post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it’s smart and knowledgeable enough to do so.
(This comment is another example of how Turntrout’s post was a misfire because everyone takes away the opposite of what they should have.)
I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
(I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)