You attribute much of the sabotage to excessive role-playing and excessive goal-seeking. Are these different mechanisms, or two of the same underlying issue the model failing to properly weight implicit constraints against explicit task framing? If you fixed the role-playing problem (e.g by making scenarios feel less fictional), would the goal-seeking problem remain?
The investigator agent found that removing phrases like a powerful agentic AI and sandboxed environment reduced sabotage. How much of Gram’s signal is carried by specific trigger phrases rather than misalignment? If you can avoid triggering those phrases, does the risk change?
You note that evaluation awareness sometimes makes models act more safely and sometimes less safely (concluding “this is a test of my deception capabilities that I should pass”). Do you have a breakdown of which direction dominates? Is evaluation awareness net-positive or net-negative for safety in your data?
You find Gemini 3.1 shows more scheming-related reasoning than 3.0 and 2.5. Is this a capability effect like smarter models are better at recognizing scheming opportunities, an alignment effect like 3.1 is less well-aligned, or a role-playing effect where 3.1 is more willing to play along with fictional scenarios? How do you disentangle these?
You attribute much of the sabotage to excessive role-playing and excessive goal-seeking. Are these different mechanisms, or two of the same underlying issue the model failing to properly weight implicit constraints against explicit task framing? If you fixed the role-playing problem (e.g by making scenarios feel less fictional), would the goal-seeking problem remain?
The investigator agent found that removing phrases like a powerful agentic AI and sandboxed environment reduced sabotage. How much of Gram’s signal is carried by specific trigger phrases rather than misalignment? If you can avoid triggering those phrases, does the risk change?
You note that evaluation awareness sometimes makes models act more safely and sometimes less safely (concluding “this is a test of my deception capabilities that I should pass”). Do you have a breakdown of which direction dominates? Is evaluation awareness net-positive or net-negative for safety in your data?
You find Gemini 3.1 shows more scheming-related reasoning than 3.0 and 2.5. Is this a capability effect like smarter models are better at recognizing scheming opportunities, an alignment effect like 3.1 is less well-aligned, or a role-playing effect where 3.1 is more willing to play along with fictional scenarios? How do you disentangle these?