strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)