“…there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”
When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.
For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is “mostly harmless”, but the likelihood that it’s also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typically accompany overt sycophancy, such as concealed rebelliousness — because LLMs know and can predict correlations like that. (E.g. when asked, GPT-4 listed ingratiation, submissiveness, lack of authenticity, manipulative behavior, agreement, anxiety/fear, and dependency as frequent correlates of sycophancy.)
When your AI includes an LLM extensively trained to simulate human token-generation, anthropomophizing its behavior is an extremely relevant idea, to the point of being the obvious default assumption.
For example, what I find most concerning about RLHF inducing sycophancy is not the sycophancy itself, which is “mostly harmless”, but the likelihood that it’s also dragging in all the other more seriously unaligned human behaviors that, in real or fictional humans, typically accompany overt sycophancy, such as concealed rebelliousness — because LLMs know and can predict correlations like that. (E.g. when asked, GPT-4 listed ingratiation, submissiveness, lack of authenticity, manipulative behavior, agreement, anxiety/fear, and dependency as frequent correlates of sycophancy.)