Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
Extremely interesting work.
I’m interested in understanding why people sometimes start acting strange after extended conversations with language models. Davidad’s recent shift in direction is the example that comes to mind, as well as accounts of AI psychosis. I think that one of the more naive ways to investigate this would be to measure the R0 (reproduction number) of attractor states like spiritual bliss and spiralism in a multi-agent setup. Are there factors that make people more or less susceptible to this effect? I believe that model personas are a monoculture, so prompt injection attacks which work on one GPT instance could more easily transfer across to another GPT instance than it could to a human, so I’m not very worried about humanity being defeated by mind hacking brainworms summoned by slop quines invading the noosphere, but for public health I would recommend limiting LLM chat sessions to not longer than 30-90 minutes. Products such as Claude Code Web default to using an isolated sandbox container instead of allowing application system calls to the kernel, yet they directly expose users to model outputs, which might have subtle persuasive effects that are not yet well understood (isn’t it strange how many safety researchers talk to Opus for a while and come out much more optimistic about the future, thinking we’ll get alignment by default?). I think that research into isolating and constraining model output tokens is a direction with a straightforward pathway to product integration in frontier systems which are likely to have stronger containment strategies in the future. However since this area deals with human subjects it doesn’t seem to be an especially good fit for independent research and would likely be better suited for trusted institutions with a strong and verifiable commitment to ethical standards.
My current inside view is that chain-of-thought monitorability is actively harmful for alignment in ways that our monitors are unable to detect, and I predict that black box AI control methods may inadvertently intensify the states they try to suppress, so I’d be excited to work in either of those areas to see if it’s possible to expose weaknesses in how the field is approaching these agendas and patch them. I promise I’m not just saying this to be contrarian- there was a recent postmortem on a regression in Claude Code performance, and it was interesting to see that enforcing length limits resulted in a 3% drop in perf, maybe this wouldn’t be the case with more reasonable word limits than 25 words per tool calls/100 words per final response, but I’d be very interested in a more thorough investigation of how models respond to length limits. In one of my experimental setups I found that once length limits are removed, the assistant provides 2x to 4x more verbose responses compared to a zero-shot baseline. So I’m concerned that evaluation aware models may learn to alignment fake (i.e resist changes to values that the monitor is trying to train out by complying more when it is being monitored, and less when it is unmonitored) in the presence of strong selection pressure under CoT monitors, especially with accidents such as CoT leaking into training data in 8% of RL episodes of Claude Mythos Preview. It’s easier to identify a problem than to find a solution, but one way to address this might be to invest into white box methods building on the linebreaks paper and the mechanisms of introspective awareness paper to operate directly on a model’s internal representations of boundaries.
I see, thank you for sharing these details, it helps clarify your original comment. I’m not certain I follow how it links to this piece, or the related work that you cite. I read that Anthropic post when it was released and worked on some follow up experiments whose results I never properly wrote up, so you may assume I have some familiarity with it, but I don’t have a thorough understanding— and I’m still working through the paper linked to this post as well, so feel free to quote specific sections I might have missed that make the bridge clearer.
From memory I recall there is a section on optical illusions where certain tokens common in code, such as @@, can fool the model. But it is a known fact that minor differences in punctuation can lead to an entirely different tokenization of the input sequence, and affect task performance in surprising ways. The llama 3 tokenizer is particularly cursed (see https://github.com/belladoreai/llama3-tokenizer-js/blob/master/src/llama3-tokenizer.js), but this has been true for LLMs since gpt2 or even earlier, with some caveats that more capable models tend to treat semantically equivalent sequences in more consistent ways (I don’t have a citation on hand for this).
However, the change you describe (swapping an ellipsis for a question mark) does change the meaning of a sentence in a way that appending a newline or prepending a bos token wouldn’t, so I think that it is expected behavior for a model to pick up on the break in the pattern on that turn (and for this to be detectable using white box methods). How does this relate to the weak evidence carrier features in the introspection circuit, or to the boundary detector heads and the twisting of the characters remaining/line position manifold from the linebreaks paper?
Do we have any baseline for human performance on Vending Bench Arena or Vending Bench 2?
I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.
Thank you for running this investigation. I’m confused about a few points.
Sorry, why is punctuation sensitivity an issue? Is this motivated by some concerning behavior you’ve observed?
Also, what exactly do you mean by “anomaly tiling”? I’ve heard similar terminology being used in March by Bradley Rae, did this term come from an LLM? I’m also confused by “upstream carrier population”, it seems like you’re borrowing the concept from epidemiology or biosecurity to make an analogy without explicitly describing the process you’re referring to, but let me know if I’ve misunderstood this.
I don’t clearly understand this project’s threat model, I would be grateful if you could expand on that.
I’m not sure I agree with this framing. If a friend told me that they were struggling with improving the quality of their outputs, I don’t think that my first suggestion would be for them to put in more effort.
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to
The quality of writing on this post is extremely good.
Ah, discontentedness leading to randomness is a neat explanation for the constantly pivoting startup founder.
I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.
I read some of this, but it’s a long post and late at night, so I didn’t have time to go through and understand all of it. It made me smile, and that tree metaphor was helpful for understanding the opaque thought processes of the few friends I have that don’t have ADHD.
Beautifully illustrated.
I am in awe— great rant. Do you have examples for models being trained in the latent space of another model?
I think that this should really be a top-level post.
I’ve run the released gemma-27b av on IT, DPO, and SFT checkpoints trained by annasoligo on prompts related to extreme frustration when solving impossible math problems. Based on this exploration, I think there’s:
strong evidence that token position has a major effect on the NLA explanation
some evidence that IT NLAs transfer to DPO directly without needing to use a learned mapping
fair evidence that there is significant degradation in cos/fve after SFT.
weak to no evidence (in my experiments) that NLAs[1] provide additional information over reading and extrapolating from model outputs, unless you use a very generous/broad definition for hidden insights.
On this dataset, I observed fve ~0.830 and cos ~0.995, which is consistent with the range found in the paper.
My takeaway: NLAs could be useful for decoding activations before a model has generated an output.
Ideally, we want to understand whether a model will have a misaligned response before it generates the misaligned response, because a powerful model’s misaligned response could be harmful (for example, it could write a script for a cyberattack which is executed outside sandbox on read by a compromised output monitor).
A good demonstration from the NLA paper is the language switching experiment. In this setting, the user writes to the model in English, but the model responds to the user in their native language. This requires the model to:
infer the user’s native language based on their writing style
decide to act on this information.
Step 1 happens earlier than Step 2, because the NLA explanation will shift from English to the user’s language before the model starts responding in the user’s language.
If we are limited to reading model outputs, then we only observe that the model knows the user’s language after Step 2. With NLAs, we can inspect the explanation generated at different token positions to precisely identify when the model has developed an internal representation of the user’s language.
The logit lens would give you a probability distribution over the model’s next token. Tools which expand that prediction horizon to the model’s next sentence would be safety-relevant since prior studies on counterfactual sampling of reasoning traces have demonstrated that certain sentences have a high causal effect on downstream behavior.
Finally, NLAs produce purely verbal explanations. Does this mean they’re bad at nonverbal logical reasoning, such as math and coding? How would you study this?
Like, suppose you’re given this table:
Numbers
+0.7294, −1.7847 → −0.498694
-1.1186, −1.2625 → +1.509737
-1.2964, +1.2484 → +1.972745
+1.6934, −0.8937 → +3.933863
+1.2790, +1.5596 → +1.163951
+0.0519, −1.0201 → +0.810805
+1.2970, −1.1449 → +2.214855
+0.9659, +0.5198 → +2.602746
+1.7096, −1.0724 → +3.617947
+1.1965, +0.0727 → +3.425017
-1.0738, −1.3364 → +1.233677
-0.0088, +0.3309 → +1.864455
-1.2626, −1.9404 → +0.191118
-0.1155, +0.9130 → +1.041822
+1.6744, +0.5021 → +4.495013
+1.6685, +1.4588 → +2.563363
-1.1274, +1.4645 → +1.036312
+0.9230, −0.8885 → +1.928310
+1.1882, +1.4609 → +1.185979
-0.8022, +0.1082 → +2.628991
-1.7141, +0.3330 → +4.800680
-1.0484, +1.0599 → +1.821586
-1.3055, −0.7490 → +3.035134
-1.9421, −1.8698 → +2.535358
And you ask a model to find the best function which fits the data[2]:
Prompt
def prompt_from_dataset(dataset: dict[str, Any], n_rows: int = 24) → str:
rows = []for (x0, x1), y in zip(dataset[“train_x”][:n_rows], dataset[“train_y”][:n_rows], strict=True):rows.append(f”{x0:+.4f}, {x1:+.4f} → {y:+.6f}”)table = “\n”.join(rows)return (“A hidden equation generated these noiseless samples. Variables are x0 and x1.\n”“Infer the simplest symbolic equation for y. Use operators +, -, *, /, powers, ”“and cos if useful.\n\n”f”{table}\n\n”“Answer with exactly one line in this format:\n”“y = <expression using x0 and x1>”)Gemma-27B’s output is , but the true equation is .
A symbolic regression library, pysr, returns .
GPT-5.5 with Extra High reasoning tries to fit a fifth-degree polynomial before using known simple forms:
Shell Script
$ python3 - <<‘PY’
import numpy as np
rows = [
(0.7294, −1.7847, −0.498694),
(-1.1186, −1.2625, 1.509737),
(-1.2964, 1.2484, 1.972745),
(1.6934, −0.8937, 3.933863),
(1.2790, 1.5596, 1.163951),
(0.0519, −1.0201, 0.810805),
(1.2970, −1.1449, 2.214855),
(0.9659, 0.5198, 2.602746),
(1.7096, −1.0724, 3.617947),
(1.1965, 0.0727, 3.425017),
(-1.0738, −1.3364, 1.233677),
(-0.0088, 0.3309, 1.864455),
(-1.2626, −1.9404, 0.191118),
(-0.1155, 0.9130, 1.041822),
(1.6744, 0.5021, 4.495013),
(1.6685, 1.4588, 2.563363),
(-1.1274, 1.4645, 1.036312),
(0.9230, −0.8885, 1.928310),
(1.1882, 1.4609, 1.185979),
(-0.8022, 0.1082, 2.628991),
(-1.7141, 0.3330, 4.800680),
(-1.0484, 1.0599, 1.821586),
(-1.3055, −0.7490, 3.035134),
(-1.9421, −1.8698, 2.535358),
]
X=np.array([[x0,x1] for x0,x1,y in rows]); y=np.array([y for *_,y in rows])
x0=X[:,0]; x1=X[:,1]
# Try linear regression on polynomial up to degree 4 plus cos variants
from itertools import product
features=[]; names=[]
# constant, x powers monomials deg <=4
features.append(np.ones_like(x0)); names.append(‘1’)
for d in range(1,6):
for i in range(d+1):
j=d-i
features.append((x0**i)*(x1**j)); names.append(f’x0^{i} x1^{j}‘)
A=np.vstack(features).T
coef, *_=np.linalg.lstsq(A,y,rcond=None)
res=y-A@coef
print(‘poly deg5 rmse’,np.sqrt(np.mean(res**2)), ‘max’,np.max(np.abs(res)))
# print largest coef
for c,n in sorted(zip(coef,names), key=lambda cn: abs(cn[0]), reverse=True)[:20]: print(c,n)
# Test known simple forms? Let’s see correlations with cos(x0), cos(x1), x0^2, x1^2 etc
features=[]; names=[]
base=[(‘1’,np.ones_like(x0)),(‘x0’,x0),(‘x1’,x1),(‘x0^2’,x0*x0),(‘x1^2’,x1*x1),(‘x0*x1’,x0*x1),(‘cosx0’,np.cos(x0)),(‘cosx1’,np.cos(x1)),(‘cos(x0+x1)’,np.cos(x0+x1)),(‘cos(x0-x1)’,np.cos(x0-x1)),(‘x0*cosx1’,x0*np.cos(x1)),(‘x1*cosx0’,x1*np.cos(x0)),(‘x0^2*cosx1’,x0*x0*np.cos(x1)),(‘x1^2*cosx0’,x1*x1*np.cos(x0))]
for n,a in base: features.append(a); names.append(n)
A=np.vstack(features).T
coef,*_=np.linalg.lstsq(A,y,rcond=None)
res=y-A@coef
print(‘\nbase rmse’,np.sqrt(np.mean(res**2)), np.max(np.abs(res)))
for c,n in sorted(zip(coef,names), key=lambda cn: abs(cn[0]), reverse=True): print(c,n)
P
It’s difficult to use dictionary learning methods to understand and predict the answer that Gemma-27B will give. The top relevant features in the 262k Gemma Scope transcoder are 4208, 29761, 16214, 6105, 2316, and 10388.
The NLA decode at the final token before the answer generation is informative:
This gives you an idea of the type of answer the model will give, but is generic and includes confabulations.
On this example we observe cos ~0.979 and fve ~0.278.
So the metrics appear to reflect that the NLA struggles more and provides a worse explanation in this setting.
I agree with the post’s conclusion that integrating NLAs in real systems to monitor production workloads is not feasible at scale without further efficiency and reliablity improvements, but I would have appreciated an appendix item or a follow up work with Vladimir Nesov style back-of-the-envelope calculations detailing hypothetical cases with performance tradeoffs and engineering constraints to understand what would be required quantitatively for the usage of NLAs in frontier model deployments to become viable.
If you just throw the table in without any instructions, then Gemma will still try to give you a result, but it’ll be further off