[Question] Could Simulating an AGI Taking Over the World Actually Lead to a LLM Taking Over the World?

As you know, large language models can be understood as simulators or more simply as models predicting what token would plausibly appear next after a first sequence of token.

As a consequence, when you ask a LLM to simulate extreme opinions of an AI, it converges very quickly towards “we should eradicate humans” (https://​​twitter.com/​​Simeon_Cps/​​status/​​1599470463578968064?s=20&t=rj2357Vof9sOLnIma6ZzIA). Other instances of that were observed on other language models.

As a consequence, is it plausible that some very powerful simulators would actually try to take over the world? Write a very detailed plan to take over the world?

It seems to me that yes. More worryingly, if LLMs trained with RL keep their “simulator” aspect, it could be quite natural for an LLM with agentic property to behave still a bit as a simulator while having the capacity to operate in the real-world. And thus it would make a scenario where a simulator literally tries to take over the world because it was asked to do what an evil AGI would do more likely.

And if it’s true, do you think that it’s a consideration to have in mind or that it’s in practice irrelevant?

No comments.