I re-read this post thinking about how and whether this applies to brains...
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be “solved” in humans by exponential / hyperbolic discounting. It’s not exactly episodic, but we’ll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like “I’m going to walk to the store”. So we get specific and timely feedback on whether that plan was successful.
We do in fact have a model class. It seems very rich; in terms of “grain of truth”, well I’m inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that’s good enough?
Just some thoughts; sorry if this is irrelevant or I’m misunderstanding anything. :-)
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be “solved” in humans by exponential / hyperbolic discounting. It’s not exactly episodic, but we’ll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that’s a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn’t mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.
We do in fact have a model class.
It’s possible that our models are entirely subservient to instrumental stuff (ie, we “learn to think” rather than “thinking to learn”, which would mean we don’t have the big split which I’m pointing to—ie, that we solve the credit assignment problem “directly” somehow, rather than needing to learn to do so.
It seems very rich; in terms of “grain of truth”, well I’m inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that’s good enough?
I think I agree with everything you wrote. I thought about it more, let me try again:
Maybe we’re getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.
We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.
How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.
How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.
How do control-theory-setpoints work? The hypotheses often entail “predictions” about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we’ll get lots of reward, we’ll get to where we want to go while expending minimal energy, etc.
Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.
Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.
It also seems plausibly compatible with brains and Hebbian learning.
I’m not sure if this answers any of your questions … Just brainstorming :-)
I re-read this post thinking about how and whether this applies to brains...
The online learning conceptual problem (as I understand your description of it) says, for example, I can never know whether it was a good idea to have read this book, because maybe it will come in handy 40 years later. Well, this seems to be “solved” in humans by exponential / hyperbolic discounting. It’s not exactly episodic, but we’ll more-or-less be able to retrospectively evaluate whether a cognitive process worked as desired long before death.
Relatedly, we seem to generally make and execute plans that are (hierarchically) laid out in time and with a success criterion at its end, like “I’m going to walk to the store”. So we get specific and timely feedback on whether that plan was successful.
We do in fact have a model class. It seems very rich; in terms of “grain of truth”, well I’m inclined to think that nothing worth knowing is fundamentally beyond human comprehension, except for contingent reasons like memory and lifespan limitations (i.e. not because they are not incompatible with the internal data structures). Maybe that’s good enough?
Just some thoughts; sorry if this is irrelevant or I’m misunderstanding anything. :-)
I interpret you as suggesting something like what Rohin is suggesting, with a hyperbolic function giving the weights.
It seems (to me) the literature establishes that our behavior can be approximately described by the hyperbolic discounting rule (in certain circumstances anyway), but, comes nowhere near establishing that the mechanism by which we learn looks like this, and in fact has some evidence against. But that’s a big topic. For a quick argument, I observe that humans are highly capable, and I generally expect actor/critic to be more capable than dumbly associating rewards with actions via the hyperbolic function. That doesn’t mean humans use actor/critic; the point is that there are a lot of more-sophisticated setups to explore.
It’s possible that our models are entirely subservient to instrumental stuff (ie, we “learn to think” rather than “thinking to learn”, which would mean we don’t have the big split which I’m pointing to—ie, that we solve the credit assignment problem “directly” somehow, rather than needing to learn to do so.
I think I agree with everything you wrote. I thought about it more, let me try again:
Maybe we’re getting off on the wrong foot by thinking about deep RL. Maybe a better conceptual starting point for human brains is more like The Scientific Method.
We have a swarm of hypotheses (a.k.a. generative models), each of which is a model of the latent structure (including causal structure) of the situation.
How does learning-from-experience work? Hypotheses gain prominence by making correct predictions of both upcoming rewards and upcoming sensory inputs. Also, when there are two competing prominent hypotheses, then specific areas where they make contradictory predictions rise to salience, allowing us to sort out which one is right.
How do priors work? Hypotheses gain prominence by being compatible with other highly-weighted hypotheses that we already have.
How do control-theory-setpoints work? The hypotheses often entail “predictions” about our own actions, and hypotheses gain prominence by predicting that good things will happen to us, we’ll get lots of reward, we’ll get to where we want to go while expending minimal energy, etc.
Thus, we wind up adopting plans that balance (1) plausibility based on direct experience, (2) plausibility based on prior beliefs, and (3) desirability based on anticipated reward.
Credit assignment is a natural part of the framework because one aspect of the hypotheses is a hypothesized mechanism about what in the world causes reward.
It also seems plausibly compatible with brains and Hebbian learning.
I’m not sure if this answers any of your questions … Just brainstorming :-)