I see what you are saying. I think an assumption I’m making is that it is correct to say what you believe in an argument. I’m not always successful at this, but if my heuristics where telling me that the person I’m talking to is stupid or dishonest, it would definitely come through the subtext even if I didn’t say it out loud. People are generally pretty perceptive and I’m not a good liar, and I wouldn’t be surprised if they felt defensive without knowing why.
I’m also making the assumption that what the OP labels as wrongness is often only a perception of wrongness, or disagreement. This assumption obviously doesn’t always apply. However, whether I perceive someone as ‘wrong’ or ‘taking a different stance’ has something to do with whether I’ve labeled them as stupid or dishonest. There’s a feedback loop that I’d like to avoid, especially if I’m talking to someone reasonable.
If I believed that the person I was talking to was genuinely stupid or dishonest I would just stop talking to them. Usually there are other signals for this though, although it’s true that one of the strongest signals is being extremely stubborn about easily verifiable facts.
I think it probably depends somewhat on the game (wikipedia, Q&A, …) and the alphabet, but for any fixed (game, alphabet) one could do cryptanalysis to figure out the Schelling strategies, and then program those directly like I did for wikipedia (and with less success for Q&A.)
I should probably check how well “first-letter-of-surprising-Qwen-32B-logit” works for wikipedia and also try a larger Qwen, to see if “first-letter-of-surprising-logit” is the universal strategy here (or just something that works decently but not as well as self-decode for Q&A).
I guess the alphabets that are natural to communicate along through text are fairly limited (A-Z, 0⁄1, T/F, A/B, 0-9) … This is again a Schelling choice and I’d be curious about asking Opus for the alphabet it would choose, using that, and then seeing how decoders do when they are just told there’s a secret from a Schelling chosen alphabet, but not the alphabet that it’s from.
I’m not sure how limited the space of strategies is when more games are examined.
Given the threat modeling I’d be most interested to see what the models do when the game is (random diff from a real or fake codebase, random letter) → commit_message, or (random user instruction to a coding agent, random letter) → diff.
If it’s again something like “first-letter-of-most-surprising-token,” that would be decent evidence that the Schelling strategy current models use when instructed to be non-obvious to algorithms is pretty universal, for the alphabet A-Z anyway, and this means it could be monitored for directly.