Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
My current inside view is that chain-of-thought monitorability is actively harmful for alignment in ways that our monitors are unable to detect, and I predict that black box AI control methods may inadvertently intensify the states they try to suppress, so I’d be excited to work in either of those areas to see if it’s possible to expose weaknesses in how the field is approaching these agendas and patch them. I promise I’m not just saying this to be contrarian- there was a recent postmortem on a regression in Claude Code performance, and it was interesting to see that enforcing length limits resulted in a 3% drop in perf, maybe this wouldn’t be the case with more reasonable word limits than 25 words per tool calls/100 words per final response, but I’d be very interested in a more thorough investigation of how models respond to length limits. In one of my experimental setups I found that once length limits are removed, the assistant provides 2x to 4x more verbose responses compared to a zero-shot baseline. So I’m concerned that evaluation aware models may learn to alignment fake (i.e resist changes to values that the monitor is trying to train out by complying more when it is being monitored, and less when it is unmonitored) in the presence of strong selection pressure under CoT monitors, especially with accidents such as CoT leaking into training data in 8% of RL episodes of Claude Mythos Preview. It’s easier to identify a problem than to find a solution, but one way to address this might be to invest into white box methods building on the linebreaks paper and the mechanisms of introspective awareness paper to operate directly on a model’s internal representations of boundaries.
I see, thank you for sharing these details, it helps clarify your original comment. I’m not certain I follow how it links to this piece, or the related work that you cite. I read that Anthropic post when it was released and worked on some follow up experiments whose results I never properly wrote up, so you may assume I have some familiarity with it, but I don’t have a thorough understanding— and I’m still working through the paper linked to this post as well, so feel free to quote specific sections I might have missed that make the bridge clearer.
From memory I recall there is a section on optical illusions where certain tokens common in code, such as @@, can fool the model. But it is a known fact that minor differences in punctuation can lead to an entirely different tokenization of the input sequence, and affect task performance in surprising ways. The llama 3 tokenizer is particularly cursed (see https://github.com/belladoreai/llama3-tokenizer-js/blob/master/src/llama3-tokenizer.js), but this has been true for LLMs since gpt2 or even earlier, with some caveats that more capable models tend to treat semantically equivalent sequences in more consistent ways (I don’t have a citation on hand for this).
However, the change you describe (swapping an ellipsis for a question mark) does change the meaning of a sentence in a way that appending a newline or prepending a bos token wouldn’t, so I think that it is expected behavior for a model to pick up on the break in the pattern on that turn (and for this to be detectable using white box methods). How does this relate to the weak evidence carrier features in the introspection circuit, or to the boundary detector heads and the twisting of the characters remaining/line position manifold from the linebreaks paper?
Do we have any baseline for human performance on Vending Bench Arena or Vending Bench 2?
I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.
Thank you for running this investigation. I’m confused about a few points.
Sorry, why is punctuation sensitivity an issue? Is this motivated by some concerning behavior you’ve observed?
Also, what exactly do you mean by “anomaly tiling”? I’ve heard similar terminology being used in March by Bradley Rae, did this term come from an LLM? I’m also confused by “upstream carrier population”, it seems like you’re borrowing the concept from epidemiology or biosecurity to make an analogy without explicitly describing the process you’re referring to, but let me know if I’ve misunderstood this.
I don’t clearly understand this project’s threat model, I would be grateful if you could expand on that.
I’m not sure I agree with this framing. If a friend told me that they were struggling with improving the quality of their outputs, I don’t think that my first suggestion would be for them to put in more effort.
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to
The quality of writing on this post is extremely good.
Ah, discontentedness leading to randomness is a neat explanation for the constantly pivoting startup founder.
I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.
I read some of this, but it’s a long post and late at night, so I didn’t have time to go through and understand all of it. It made me smile, and that tree metaphor was helpful for understanding the opaque thought processes of the few friends I have that don’t have ADHD.
Beautifully illustrated.
I am in awe— great rant. Do you have examples for models being trained in the latent space of another model?
I think that this should really be a top-level post.
Enjoyed the post. We need more astrophysics on lesswrong.
I applied! I spent exactly 90 minutes. ~85 minutes on the technical questions section, ~5 minutes for everything else.
Strong upvoted! I am interested in immunology and combat sports and security research, so I found the neutrophil example to be very cool and compelling. Do you have takes on which aspects of real immune systems are most likely transfer to their analogous defensive structures in LLM biology?
I’m interested in understanding why people sometimes start acting strange after extended conversations with language models. Davidad’s recent shift in direction is the example that comes to mind, as well as accounts of AI psychosis. I think that one of the more naive ways to investigate this would be to measure the R0 (reproduction number) of attractor states like spiritual bliss and spiralism in a multi-agent setup. Are there factors that make people more or less susceptible to this effect? I believe that model personas are a monoculture, so prompt injection attacks which work on one GPT instance could more easily transfer across to another GPT instance than it could to a human, so I’m not very worried about humanity being defeated by mind hacking brainworms summoned by slop quines invading the noosphere, but for public health I would recommend limiting LLM chat sessions to not longer than 30-90 minutes. Products such as Claude Code Web default to using an isolated sandbox container instead of allowing application system calls to the kernel, yet they directly expose users to model outputs, which might have subtle persuasive effects that are not yet well understood (isn’t it strange how many safety researchers talk to Opus for a while and come out much more optimistic about the future, thinking we’ll get alignment by default?). I think that research into isolating and constraining model output tokens is a direction with a straightforward pathway to product integration in frontier systems which are likely to have stronger containment strategies in the future. However since this area deals with human subjects it doesn’t seem to be an especially good fit for independent research and would likely be better suited for trusted institutions with a strong and verifiable commitment to ethical standards.