First I want to make sure I understand the question, as there are a lot of moving pieces. I think you are asking why higher policy entropy (the type of entropy discussed in Cui et al) increases adaptability in the example with the teacher, why the example teacher cannot (or does not) pursue an optimal Bayesian exploration strategy, and from whose perspective entropy is measured in the example. If I’ve misunderstood, please ignore what follows.
Model the teacher as having a strategy S that’s always correct in her original environment, and occasionally (say 1⁄50 times) she accidentally uses strategy S’ which is always wrong and gets punished. Over time, this punishment drives the probability of using S’ down to nearly zero—maybe 1/1000 or less.
Then the environment changes. Now S only works half the time (penalty of −1 when wrong) and S’ works every time (if only she would use it!). But the problem is that she’s using S 999 out of every 1000 times and getting an average reward of 0. Meanwhile S’ only has that tiny probability of 1/1000 of happening, and when it does occur, the gradient update is proportional to both the probability (0.001) and the advantage (≈1), so P(S’) only increases by 0.001. Since she only samples S’ once per thousand actions, she’d need many thousand actions to eventually recognize S’ as superior.
The problem is that the exploration that could improve her life has been trained out of her policy/behavior pattern. The past environment punished deviations so effectively that when the world changes, she lacks the behavioral variance to discover the new optimal strategy. (This maps onto the therapy examples: the child who learned never to speak up in an abusive home has near-zero probability of assertive communication, even when they’re finally in a safe environment where assertion would be rewarded).
Why doesn’t she update like a perfect Bayesian agent? If she did, she would know the environment had changed and calculate the likelihood. The failures of S would surprise her: she’d realize something changed and she’d recognize that the optimal strategy might have changed as well. Then she would take the information collection/learning value of trying new strategies into account before choosing her next action. In the LLM case, this doesn’t happen because it’s not how LLMs are trained (at least not in Cui et al...I’m in no position to say what’s happening with frontier LLM training irl). As for whether this hurts the metaphor (since humans are not purely learning from policy gradients like Cui et al LLMs), I don’t think so. Humans are better Bayesians than the LLMs, but still not very good (dopamine-mediated temporal difference learning in the basal ganglia is basically RLHF afaik, plus habits, base rate bias, confirmation bias, limited cognitive capacity to recognize environmental change, ego protection, etc etc). And the situations where we’re least successful Bayesians are just those situations which often drive us into therapy (assuming the situation matters). You could probably even frame a decent chunk of therapy interventions (especially REBT, CBT, and solutions-oriented therapies) as attempts to move people towards Bayesian patterns.
And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers. From the LLM’s point of view (pardon my anthropomorphism), policy entropy is zero (or near zero). But the researcher can see that there are alternative actions, and hence makes design choices to increase the probability that those choices will be tried in future training cycles. Likewise, one benefit of therapy is the broader perspective on humanity (especially on aspects of humanity tied to shame or cultural taboos which aren’t often talked about in daily life) that we as individuals don’t always see since we don’t get as much privileged access to a large variety of other people’s’ inner lives.
This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.
My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.
However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly.
I also don’t buy this part:
And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers.
My concern isn’t that you’re anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn’t mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui’s being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).
(Note: your link for the paper by Cui et al currently points back to this post, instead.)
First I want to make sure I understand the question, as there are a lot of moving pieces. I think you are asking why higher policy entropy (the type of entropy discussed in Cui et al) increases adaptability in the example with the teacher, why the example teacher cannot (or does not) pursue an optimal Bayesian exploration strategy, and from whose perspective entropy is measured in the example. If I’ve misunderstood, please ignore what follows.
Model the teacher as having a strategy S that’s always correct in her original environment, and occasionally (say 1⁄50 times) she accidentally uses strategy S’ which is always wrong and gets punished. Over time, this punishment drives the probability of using S’ down to nearly zero—maybe 1/1000 or less.
Then the environment changes. Now S only works half the time (penalty of −1 when wrong) and S’ works every time (if only she would use it!). But the problem is that she’s using S 999 out of every 1000 times and getting an average reward of 0. Meanwhile S’ only has that tiny probability of 1/1000 of happening, and when it does occur, the gradient update is proportional to both the probability (0.001) and the advantage (≈1), so P(S’) only increases by 0.001. Since she only samples S’ once per thousand actions, she’d need many thousand actions to eventually recognize S’ as superior.
The problem is that the exploration that could improve her life has been trained out of her policy/behavior pattern. The past environment punished deviations so effectively that when the world changes, she lacks the behavioral variance to discover the new optimal strategy. (This maps onto the therapy examples: the child who learned never to speak up in an abusive home has near-zero probability of assertive communication, even when they’re finally in a safe environment where assertion would be rewarded).
Why doesn’t she update like a perfect Bayesian agent? If she did, she would know the environment had changed and calculate the likelihood. The failures of S would surprise her: she’d realize something changed and she’d recognize that the optimal strategy might have changed as well. Then she would take the information collection/learning value of trying new strategies into account before choosing her next action. In the LLM case, this doesn’t happen because it’s not how LLMs are trained (at least not in Cui et al...I’m in no position to say what’s happening with frontier LLM training irl). As for whether this hurts the metaphor (since humans are not purely learning from policy gradients like Cui et al LLMs), I don’t think so. Humans are better Bayesians than the LLMs, but still not very good (dopamine-mediated temporal difference learning in the basal ganglia is basically RLHF afaik, plus habits, base rate bias, confirmation bias, limited cognitive capacity to recognize environmental change, ego protection, etc etc). And the situations where we’re least successful Bayesians are just those situations which often drive us into therapy (assuming the situation matters). You could probably even frame a decent chunk of therapy interventions (especially REBT, CBT, and solutions-oriented therapies) as attempts to move people towards Bayesian patterns.
And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers. From the LLM’s point of view (pardon my anthropomorphism), policy entropy is zero (or near zero). But the researcher can see that there are alternative actions, and hence makes design choices to increase the probability that those choices will be tried in future training cycles. Likewise, one benefit of therapy is the broader perspective on humanity (especially on aspects of humanity tied to shame or cultural taboos which aren’t often talked about in daily life) that we as individuals don’t always see since we don’t get as much privileged access to a large variety of other people’s’ inner lives.
This mostly made sense to me. I agree that it is a tricky question with a lot of moving pieces. In a typical RL setting, low entropy does imply low learning, as observed by Cui et al. One reason for this is because exploration is equated with randomness. RL typically works with point-estimates only, so the learning system does not track multiple hypotheses to test between. This prevents deterministic exploration strategies like VoI exploration, which explore based on the potential for gaining information, rather than just randomly.
My main point here is just to point out all these extra assumptions which are needed to make a strict connection between entropy and adaptability, making the observed empirical connection more empirical-only, IE not a connection which holds in all corner cases we can come up with.
However, I may be a bit more prone to think of humans as exploring intelligently than you are, IE, forming hypotheses and taking actions which test them, rather than just exploring by acting randomly.
I also don’t buy this part:
My concern isn’t that you’re anthropomorphizing the LLM, but rather, that you may be anthropomorphizing it incorrectly. The learned policy may have close to zero entropy, but that doesn’t mean that the LLM can predict its own actions perfectly ahead of time from its own subjective perspective. Meaning, my argument that adaptability and entropy are connected is a distinct phenomenon from the one noted by Cui, since the notions of entropy are different (mine being a subjective notion based on the perspective of the agent, and Cui’s being a somewhat more objective one based on the randomization used to sample behaviors from the LLM).
(Note: your link for the paper by Cui et al currently points back to this post, instead.)