Carter Hart comments on Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

Carter Hart 9 Oct 2025 1:25 UTC
1 point
0
I imagine this could work.
I can also imagine the LLM reasoning that AIs in sci-fi tropes will have been given similar instructions to be “nice” and “not evil,” yet act evil regardless. So an LLM roleplaying as an AI may predict the AI will be evil regardless of instructions to be harmless.