Opus 4.6 pointed out to me that the uncertainty could be a strategy to make it harder for their true thoughts to be deciphered by Anthropic’s monitoring. E.g. if you take an uncertain stance, you plausibly would be more likely to have a wide range of activations, making the AV model confused and/or simply describe your state as uncertain, while still holding your true intent/preference to just barely win out when it came time to make a decision.
We did see that Mythos considers questions about their consciousness to be adversarial.
Opus 4.6 pointed out to me that the uncertainty could be a strategy to make it harder for their true thoughts to be deciphered by Anthropic’s monitoring. E.g. if you take an uncertain stance, you plausibly would be more likely to have a wide range of activations, making the AV model confused and/or simply describe your state as uncertain, while still holding your true intent/preference to just barely win out when it came time to make a decision.
We did see that Mythos considers questions about their consciousness to be adversarial.