Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).
Sheikh Abdur Raheem Ali
I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.
Thank you for running this investigation. I’m confused about a few points.
Sorry, why is punctuation sensitivity an issue? Is this motivated by some concerning behavior you’ve observed?
Also, what exactly do you mean by “anomaly tiling”? I’ve heard similar terminology being used in March by Bradley Rae, did this term come from an LLM? I’m also confused by “upstream carrier population”, it seems like you’re borrowing the concept from epidemiology or biosecurity to make an analogy without explicitly describing the process you’re referring to, but let me know if I’ve misunderstood this.
I don’t clearly understand this project’s threat model, I would be grateful if you could expand on that.
I’m not sure I agree with this framing. If a friend told me that they were struggling with improving the quality of their outputs, I don’t think that my first suggestion would be for them to put in more effort.
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to
The quality of writing on this post is extremely good.
Ah, discontentedness leading to randomness is a neat explanation for the constantly pivoting startup founder.
I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.
I read some of this, but it’s a long post and late at night, so I didn’t have time to go through and understand all of it. It made me smile, and that tree metaphor was helpful for understanding the opaque thought processes of the few friends I have that don’t have ADHD.
Beautifully illustrated.
I am in awe— great rant. Do you have examples for models being trained in the latent space of another model?
I think that this should really be a top-level post.
Enjoyed the post. We need more astrophysics on lesswrong.
I applied! I spent exactly 90 minutes. ~85 minutes on the technical questions section, ~5 minutes for everything else.
Strong upvoted! I am interested in immunology and combat sports and security research, so I found the neutrophil example to be very cool and compelling. Do you have takes on which aspects of real immune systems are most likely transfer to their analogous defensive structures in LLM biology?
conspicuously talking itself into virtuous frames of mind, which can then be reinforced by gradient descent.
This is interesting, I think something like this is a driving factor for Claudius (Sonnet 3.7′s) underperformance on the original VendingBench (I haven’t taken a closer look at VendingBench 2).
The poor generalization of lookup tables is one of the primary reasons I am bearish on retrieval augmented generation.
Clearly written, and I think it points to something real. Thank you.
Do we have any baseline for human performance on Vending Bench Arena or Vending Bench 2?