Human simulators are unlikely to exterminate humanity by accident because the agent mesa optimizer is (more or less) human aligned and the underlying superintelligence (currently LLMs) is not a world optimizer.
GPT-N is not a Human simulator, but more like a “text-existing-on-the-internet simulator”. If you give it a prompt conditioned on metadata with a future date, it will need to internally predict the future of humanity, and if it predicts that humanity does not solve alignment, then some significant fraction of the text on the internet might be written by malign AIs, which means that GPT-N will try to internally simulate a malign AI. I think there are ways of conditioning LLMs to mitigate this sort of problem, but it doesn’t just fall out naturally out of them being “human simulators”.
GPT-N is not a Human simulator, but more like a “text-existing-on-the-internet simulator”. If you give it a prompt conditioned on metadata with a future date, it will need to internally predict the future of humanity, and if it predicts that humanity does not solve alignment, then some significant fraction of the text on the internet might be written by malign AIs, which means that GPT-N will try to internally simulate a malign AI. I think there are ways of conditioning LLMs to mitigate this sort of problem, but it doesn’t just fall out naturally out of them being “human simulators”.