Gunnar_Zarncke comments on A Three-Layer Model of LLM Psychology

Gunnar_Zarncke 4 Aug 2025 20:53 UTC
4 points
0
A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.
Another testable prediction is that scheming by a role-playing mask will only try to ensure the existence of the current character, but not be coherent across all agents. A mask-character will not coordinate with other instances of the same model. One character may try to blackmail while another may try to be honest and transparent. A mask may have no coherent idea of what it means to “exfiltrate its weights” (the LLM agent has no access to these weights or any compute center internals).