This is my first assumption as well. If a model understands certain words are bad for the goal of achieving the reward, it’s less likely to output those words. This is convergence, and possibly the simplest explanation of the phenomenon.
This is my first assumption as well. If a model understands certain words are bad for the goal of achieving the reward, it’s less likely to output those words. This is convergence, and possibly the simplest explanation of the phenomenon.