> seeking reward because it is reward and that is somehow good
I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.
I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.
It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”
> seeking reward because it is reward and that is somehow good
I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.
I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.
It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”
So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379
Thanks for the reply! I thought that you were saying the reward seeking was likely to be terminal. This makes a lot more sense.