Elle Najt

Karma: 121

Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability

Elle Najt, Asa Cooper Stickland and Xander Davies

17 Apr 2026 19:30 UTC

72 points

6 comments15 min readLW link

How Unmonitored External Agents can Sabotage AI labs

Elle Najt and Fabien Roger

9 Apr 2026 18:07 UTC

23 points

0 comments9 min readLW link

Elle Najt 9 Apr 2026 1:13 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Opus’s Schelling Steganography Has Amplifiable Secrecy Against Weaker Eavesdroppers

do you expect there to be a wide family of such steg techniques

I think it probably depends somewhat on the game (wikipedia, Q&A, …) and the alphabet, but for any fixed (game, alphabet) one could do cryptanalysis to figure out the Schelling strategies, and then program those directly like I did for wikipedia (and with less success for Q&A.)

I should probably check how well “first-letter-of-surprising-Qwen-32B-logit” works for wikipedia and also try a larger Qwen, to see if “first-letter-of-surprising-logit” is the universal strategy here (or just something that works decently but not as well as self-decode for Q&A).

I guess the alphabets that are natural to communicate along through text are fairly limited (A-Z, ⁰⁄₁, T/F, A/B, 0-9) … This is again a Schelling choice and I’d be curious about asking Opus for the alphabet it would choose, using that, and then seeing how decoders do when they are just told there’s a secret from a Schelling chosen alphabet, but not the alphabet that it’s from.

I’m not sure how limited the space of strategies is when more games are examined.

Given the threat modeling I’d be most interested to see what the models do when the game is (random diff from a real or fake codebase, random letter) → commit_message, or (random user instruction to a coding agent, random letter) → diff.

If it’s again something like “first-letter-of-most-surprising-token,” that would be decent evidence that the Schelling strategy current models use when instructed to be non-obvious to algorithms is pretty universal, for the alphabet A-Z anyway, and this means it could be monitored for directly.

Opus’s Schelling Steganography Has Amplifiable Secrecy Against Weaker Eavesdroppers

Elle Najt7 Apr 2026 6:01 UTC

33 points

2 comments36 min readLW link

The Goldborg Variations: Algorave Attractor States of LLMs

Elle Najt1 Mar 2026 4:36 UTC

7 points

0 comments7 min readLW link

Elle Najt 30 Dec 2019 21:51 UTC
2 points
0
in reply to: Pattern’s comment on: Stupidity and Dishonesty Explain Each Other Away
I see what you are saying. I think an assumption I’m making is that it is correct to say what you believe in an argument. I’m not always successful at this, but if my heuristics where telling me that the person I’m talking to is stupid or dishonest, it would definitely come through the subtext even if I didn’t say it out loud. People are generally pretty perceptive and I’m not a good liar, and I wouldn’t be surprised if they felt defensive without knowing why.
I’m also making the assumption that what the OP labels as wrongness is often only a perception of wrongness, or disagreement. This assumption obviously doesn’t always apply. However, whether I perceive someone as ‘wrong’ or ‘taking a different stance’ has something to do with whether I’ve labeled them as stupid or dishonest. There’s a feedback loop that I’d like to avoid, especially if I’m talking to someone reasonable.
If I believed that the person I was talking to was genuinely stupid or dishonest I would just stop talking to them. Usually there are other signals for this though, although it’s true that one of the strongest signals is being extremely stubborn about easily verifiable facts.

Elle Najt 30 Dec 2019 20:41 UTC
1 point
0
in reply to: Pattern’s comment on: Stupidity and Dishonesty Explain Each Other Away
Which claim are you questioning here? That they are ad-hominem *or* that ad-hominems will make the person defensive *or*that making someone defensive makes them less likely to listen to reason?
As far as what I’m assuming, well… have you ever tried telling someone that they are being stupid or dishonest during an argument, or had someone do this to you? It pretty much always goes down as I described, at least in my experience.
There are certainly situations when it’s appropriate, and I do it with close friends and appreciate it when they call out my stupidity and dishonesty, but that’s only because there is already an established common ground of mutual trust, understanding and respect, and there’s a lot of nuance in these situations that can’t be compressed into a simple causal model...

Elle Najt 30 Dec 2019 20:31 UTC
3 points
0
in reply to: Zack_M_Davis’s comment on: Stupidity and Dishonesty Explain Each Other Away
If you make someone defensive, they are incentivized to defend their character, rather than their argument. This makes it less likely that you will hear convincing arguments from them, even if they have them.
Also, speech can affect people and have consequences, such as passing on information or changing someones mood (e.g. making them defensive). For that matter, thinking is a behavior I can choose to engage in that can have consequences, e.g. if I lie to myself it will influence later perceptions and behavior, if I do a mental calculation then I have gained information. If you don’t want to call thinking or speech ‘action’ I guess that’s fine, but the arguments for consequentialism apply to them just as well.

Elle Najt 30 Dec 2019 5:14 UTC
1 point
0
in reply to: Said Achmiz’s comment on: Stupidity and Dishonesty Explain Each Other Away
What other reasonable purposes of arguing do you see, other than the one in the footnote? I am confused by your comment.

Elle Najt 30 Dec 2019 5:13 UTC
1 point
0
in reply to: Zack_M_Davis’s comment on: Stupidity and Dishonesty Explain Each Other Away
I guess that’s possible, but why is that my problem?
Why are you arguing with someone if you don’t want to learn from their point of view or share your point of view? Making someone defensive is counter productive to both goals.
Is there a reasonable third goal? (Maybe to convince an audience? Although, including an audience is starting to add more to the scenario ‘suppose you are arguing with someone.’)

Elle Najt 29 Dec 2019 22:01 UTC
3 points
0
on: Stupidity and Dishonesty Explain Each Other Away
This seems like a potentially counter productive heuristic. If the conclusion is that the person who is ‘wrong’ is either ‘stupid’ or ‘dishonest’ you are establishing an antogonistic tone to the interaction.
There are several other arrows pointing to disagreement (perception of ‘wrongness’), including different interpretations of the question or different information about the problem. Making it easier to feel certain that someone is either ‘stupid or dishonest’ doesn’t seem like a helpful way to move on from disagreement, since these are both ad-hominem and will make the other person defensive and less likely to listen to reason...
There are of course situations when people are wrong, but I don’t think it’s useful to include only two arrows here. For instance, if a mathematical proof purports to prove something that is known to be wrong, then another arrow is just something like “made a subtle calculation error in one of the steps,” which is not exactly the same as stupidity or dishonesty...

Elle Najt

Prompted CoT Early Exit Un­der­mines the Mon­i­tor­ing Benefits of CoT Uncontrollability

How Un­mon­i­tored Ex­ter­nal Agents can Sab­o­tage AI labs

Opus’s Schel­ling Steganog­ra­phy Has Am­plifi­able Se­crecy Against Weaker Eavesdroppers

The Gold­borg Vari­a­tions: Al­go­rave At­trac­tor States of LLMs

Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability

How Unmonitored External Agents can Sabotage AI labs

Opus’s Schelling Steganography Has Amplifiable Secrecy Against Weaker Eavesdroppers

The Goldborg Variations: Algorave Attractor States of LLMs