Thanks for the replies! I do want to clarify the distinction between specification gaming and reward seeking (seeking reward because it is reward and that is somehow good). For example, I think desire to edit the RAM of machines that calculate reward to increase it (or some other desire to just increase the literal number) is pretty unlikely to emerge but non reward seeking types of specification gaming like making inaccurate plots that look nicer to humans or being sycophantic are more likely. I think the phrase “reward seeking” I’m using is probably a bad phrase here (I don’t know a better way to phrase this distinction), but I hope I’ve conveyed what I mean. I agree that this distinction is probably not a crux.
What I mean is that the propensities of AI agents change over time -- much like how human goals change over time.
I understand your model here better now, thanks! I don’t have enough evidence about how long-term agentic AIs will work to evaluate how likely this is.
> seeking reward because it is reward and that is somehow good
I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.
I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.
It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”
Thanks for the replies! I do want to clarify the distinction between specification gaming and reward seeking (seeking reward because it is reward and that is somehow good). For example, I think desire to edit the RAM of machines that calculate reward to increase it (or some other desire to just increase the literal number) is pretty unlikely to emerge but non reward seeking types of specification gaming like making inaccurate plots that look nicer to humans or being sycophantic are more likely. I think the phrase “reward seeking” I’m using is probably a bad phrase here (I don’t know a better way to phrase this distinction), but I hope I’ve conveyed what I mean. I agree that this distinction is probably not a crux.
I understand your model here better now, thanks! I don’t have enough evidence about how long-term agentic AIs will work to evaluate how likely this is.
> seeking reward because it is reward and that is somehow good
I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.
I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.
It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”
So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379
Thanks for the reply! I thought that you were saying the reward seeking was likely to be terminal. This makes a lot more sense.