I think I agree with everything you wrote. I thought about it more, let me try again:
Maybe we’re getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.
We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.
How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.
How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.
How do control-theory-setpoints work? The hypotheses often entail “predictions” about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we’ll get lots of reward, we’ll get to where we want to go while expending minimal energy, etc.
Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.
Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.
It also seems plausibly compatible with brains and Hebbian learning.
I’m not sure if this answers any of your questions … Just brainstorming :-)
I think I agree with everything you wrote. I thought about it more, let me try again:
Maybe we’re getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.
We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.
How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.
How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.
How do control-theory-setpoints work? The hypotheses often entail “predictions” about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we’ll get lots of reward, we’ll get to where we want to go while expending minimal energy, etc.
Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.
Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.
It also seems plausibly compatible with brains and Hebbian learning.
I’m not sure if this answers any of your questions … Just brainstorming :-)