Vladimir_Nesov comments on anaguma’s Shortform

Vladimir_Nesov 22 Nov 2025 21:37 UTC
12 points
5

strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment

Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)