TurnTrout comments on Compendium of problems with RLHF

TurnTrout 1 Jun 2023 19:45 UTC
3 points
0
The model has been shaped to maximize its reward by any means necessary^[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the “playing the training game” scenario.
This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that “the model has been shaped to maximize its reward by any means necessary”? This is likewise not weak evidence for playing the training game.