This post is not only a groundbreaking research into the nature of LLMs but also a perfect meme. Janus’s ideas are now widely cited at AI conferences and papers around the world. While the assumptions may be correct or incorrect, the Simulators theory has sparked huge interest among a broad audience, including not only AI researchers. Let’s also appreciate the fact that this post was written based on the author’s interactions with non-RLHFed GPT-3 model, well before the release of ChatGPT or Bing, and it has accurately predicted some quirks in their behaviors.
For me, the most important implication of the Simulators theory is that LLMs are neither agents nor tools. Therefore, the alignment/safety measures developed within the Bostromian paradigm are not applicable to them, a point Janus later beautifully illustrated in the Waluigi Effect post. This leads me to believe that AI alignment has to be a practical discipline and cannot rely purely on theoretical scenarios.
I’ve only had a chance to briefly skim through your post (will read it in details later) but I profoundly disagree with this statement:
As both janus in Simulators and later nostalgebraist in the void have shown, a text written by a LLM is always written by (a simulated) someone. LLMs cannot write without internally (re)constructing the personality of an author who could have written these words—indeed, often having zero evidence what personality this author might have had. The only difference from human writing is that in the case of LLMs the author is always virtual but it does not make their personality, mental states, and purpose for writing less elaborated. This personality still exists in the model’s internal representations, as well as billions of other potential virtual authors.