I can also imagine the LLM reasoning that AIs in sci-fi tropes will have been given similar instructions to be “nice” and “not evil,” yet act evil regardless. So an LLM roleplaying as an AI may predict the AI will be evil regardless of instructions to be harmless.
I imagine this could work.
I can also imagine the LLM reasoning that AIs in sci-fi tropes will have been given similar instructions to be “nice” and “not evil,” yet act evil regardless. So an LLM roleplaying as an AI may predict the AI will be evil regardless of instructions to be harmless.