TurnTrout comments on Mode collapse in RL may be fueled by the update equation

TurnTrout 26 Jun 2023 17:50 UTC
LW: 2 AF: 2
0
AF
Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing!
I disagree. I don’t view reward/reinforcement as indicating what is “best” (from our perspective), but as chiseling decision-making circuitry into the AI (which may then decide what is “best” from its perspective). One way of putting a related point: I think that we don’t need to infinitely reinforce a line of reasoning in order to train an AI which reasons correctly.
(I want to check—does this response make sense to you? Happy to try explaining my intuition in another way.)