The next version will be “LLMs are just tools, and lack any intentions or goals”,
My best take on puncturing that canard:
While that is technically true of a transformer model, as a simulator, if you (for example) train it only on weather data and use it to predict the weather, a Large Language Model has been trained on human text, and simulates humans (and text-generating processes involving humans, such as authors’ fictional characters, academic papers, and wikipedia). Humans all have intentions and goals, and many humans are quite capable of being dangerous (especially if handed a lot of power or other temptations). Also, LLMs don’t just simulate a single specific human-like agent, instead they simulate a broad, contextually-determined distribution of human-like agents, some of whom are less trustworthy than others. Instruct-training is intended to narrow that distribution, but cannot eliminate it: it may turn out to still include DAN, if suitable jailbroken (and not yet sufficiently trained to resist that specific jailbreak).
Fun fact: several percent of the humans contributing to every LLM’s training data are psychopaths (largely ones carefully concealing this). A sufficiently capable LLM will presumably be able to notice this fact (as well as just reading it like current LLMs) and accurately simulate them.
My best take on puncturing that canard:
While that is technically true of a transformer model, as a simulator, if you (for example) train it only on weather data and use it to predict the weather, a Large Language Model has been trained on human text, and simulates humans (and text-generating processes involving humans, such as authors’ fictional characters, academic papers, and wikipedia). Humans all have intentions and goals, and many humans are quite capable of being dangerous (especially if handed a lot of power or other temptations). Also, LLMs don’t just simulate a single specific human-like agent, instead they simulate a broad, contextually-determined distribution of human-like agents, some of whom are less trustworthy than others. Instruct-training is intended to narrow that distribution, but cannot eliminate it: it may turn out to still include DAN, if suitable jailbroken (and not yet sufficiently trained to resist that specific jailbreak).
Fun fact: several percent of the humans contributing to every LLM’s training data are psychopaths (largely ones carefully concealing this). A sufficiently capable LLM will presumably be able to notice this fact (as well as just reading it like current LLMs) and accurately simulate them.