Having trouble getting Opus 4.7 to guess who I am from a few paragraphs of writing, even to the point of my name being in a top-10-guesses list. But I was able to get GPT 4.5 to do this a year ago so that capability might vary author-to-author and model-to-model.
faul_sname
That proposal seems like it’d mitigate the problem somewhat, but even then I think there are nonobvious paths to end up with significant pressure on the CoT. For example, I frequently will ask a question, then watch the CoT summaries to see how the model is approaching the problem, and stop the response and edit my prompt and retry if it’s going down a path I don’t like. If someone has the bright idea to penalize trajectories which resulted in the user cancelling the request partway through, you’ve now got pressure on the CoT.
I think there are a bunch of weird side channels like this, to the point where I’m not sure that companies know how to pay the associated safety tax even if they want to.
I don’t entirely understand how “never, EVER training on the chain of thought” is supposed to work in practice, because
Models frequently reason outside of thinking blocks
Reasoning outside of thinking blocks will obviously contribute to the training signal just like any other outside-the-thinking-blocks output.
Cognitive machinery the model develops outside of the thinking blocks can be used within the thinking blocks (and vice versa).
It’s a noble and I think valuable goal, but I find it questionable to treat it as something that must be executed perfectly or else it has no value.
Does “aligned” have to include modeling the entire ecosystem the agent is embedded in (including any other agents in that ecosystem) well enough that its actions won’t have any unanticipated consequences? Is it possible to have an aligned agent that is not the smartest agent in its environment? Is “aligned” a one-place word, or even an only-two place word?
I think “Claude Code silently auto-updated overnight and the workflow which had been working for me stopped working” is a pretty common experience. On the claude.ai web side of things, the length of the reasoning blocks definitely shifted sometime in the last few weeks, in a way that is not subtle at all. I don’t know if either of those count as “nerfing the model”—strictly speaking they probably don’t—but they definitely both constitute nerfs to the experience of using the model.
I think that post is around 427 upvotes great, yeah. There are well over a hundred posts with between 350 and 500 karma, and this post seems fairly medianish relative to that list. I would not be surprised if some people updated it based on the title alone without reading the post, and I think that’s unfortunate, but also it was a pretty good post with a strong thesis backed up by concrete and specific evidence (the latter part of which is IMO often missing).
For reference, posts which had between 350 and 500 karma as of 2026-04-17
How Does A Blind Model See The Earth? (498 karma)
Eliezer’s Unteachable Methods of Sanity (498 karma)
The ants and the grasshopper (498 karma)
How To Write Quickly While Maintaining Epistemic Rigor (495 karma)
Luck based medicine: my resentful story of becoming a medical miracle (495 karma)
Alignment Faking in Large Language Models (491 karma)
Focus on the places where you feel shocked everyone’s dropping the ball (490 karma)
I would have shit in that alley, too (488 karma)
New Endorsements for “If Anyone Builds It, Everyone Dies” (488 karma)
100 Tips for a Better Life (471 karma)
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible (469 karma)
The Lens That Sees Its Flaws (469 karma)
The Best Tacit Knowledge Videos on Every Subject (456 karma)
Ugh fields (456 karma)
Counter-theses on Sleep (455 karma)
Current AIs seem pretty misaligned to me (452 karma)
Accountability Sinks (452 karma)
What failure looks like (449 karma)
Things I Learned by Spending Five Thousand Hours In Non-EA Charities (448 karma)
It’s Probably Not Lithium (447 karma)
Generalizing From One Example (445 karma)
You Are Not Measuring What You Think You Are Measuring (442 karma)
Claude 4.5 Opus’ Soul Document (441 karma)
Steering GPT-2-XL by adding an activation vector (441 karma)
Bets, Bonds, and Kindergarteners (439 karma)
What Do We Mean By “Rationality”? (438 karma)
Transformers Represent Belief State Geometry in their Residual Stream (437 karma)
The hostile telepaths problem (436 karma)
The noncentral fallacy—the worst argument in the world? (435 karma)
Turning 20 in the probable pre-apocalypse (432 karma)
Failures in Kindness (432 karma)
How AI Takeover Might Happen in 2 Years (430 karma)
Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)? (430 karma)
HPMOR: The (Probably) Untold Lore (429 karma)
That Alien Message (428 karma)
GPTs are Predictors, not Imitators (427 karma)
Will Jesus Christ return in an election year? (427 karma)
How I got 4.2M YouTube views without making a single video (426 karma)
The Case Against AI Control Research (426 karma)
chinchilla’s wild implications (425 karma)
Expecting Short Inferential Distances (421 karma)
Applause Lights (418 karma)
It Looks Like You’re Trying To Take Over The World (418 karma)
Twelve Virtues of Rationality (418 karma)
(My understanding of) What Everyone in Technical Alignment is Doing and Why (414 karma)
the void (411 karma)
Survival without dignity (409 karma)
Reliable Sources: The Story of David Gerard (409 karma)
Bing Chat is blatantly, aggressively misaligned (408 karma)
Playing in the Creek (404 karma)
Dying Outside (404 karma)
Lies Told To Children (399 karma)
There is way too much serendipity (399 karma)
DeepMind alignment team opinions on AGI ruin arguments (397 karma)
Please don’t throw your mind away (394 karma)
Reflections on six months of fatherhood (394 karma)
How to Ignore Your Emotions (while also thinking you’re awesome at emotions) (392 karma)
Intellectual Hipsters and Meta-Contrarianism (392 karma)
A Fable of Science and Politics (387 karma)
Legible vs. Illegible AI Safety Problems (385 karma)
Reward is not the optimization target (385 karma)
Scope Insensitivity (385 karma)
What Money Cannot Buy (384 karma)
Working hurts less than procrastinating, we fear the twinge of starting (384 karma)
Did Claude 3 Opus align itself via gradient hacking? (383 karma)
Statement on AI Extinction—Signed by AGI Labs, Top Academics, and Many Other Notable Figures (383 karma)
MIRI announces new “Death With Dignity” strategy (382 karma)
My hour of memoryless lucidity (381 karma)
Social Dark Matter (380 karma)
Religion’s Claim to be Non-Disprovable (379 karma)
Lessons I’ve Learned from Self-Teaching (379 karma)
Review: Planecrash (378 karma)
Alignment remains a hard, unsolved problem (378 karma)
Notifications Received in 30 Minutes of Class (378 karma)
Noting an error in Inadequate Equilibria (377 karma)
AI Induced Psychosis: A shallow investigation (377 karma)
Anti-Aging: State of the Art (376 karma)
A Bear Case: My Predictions Regarding AI Progress (376 karma)
A Mechanistic Interpretability Analysis of Grokking (375 karma)
Counterarguments to the basic AI x-risk case (375 karma)
How it feels to have your mind hacked by an AI (374 karma)
Staring into the abyss as a core life skill (373 karma)
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (373 karma)
How to have Polygenically Screened Children (372 karma)
Taboo “Outside View” (371 karma)
How AI Is Learning to Think in Secret (370 karma)
Hospitalization: A Review (370 karma)
A deep critique of AI 2027’s bad timeline models (369 karma)
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment (369 karma)
To listen well, get curious (369 karma)
Accounting For College Costs (368 karma)
The Parable of Predict-O-Matic (367 karma)
Generalized Hangriness: A Standard Rationalist Stance Toward Emotions (367 karma)
What’s the short timeline plan? (367 karma)
Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety) (366 karma)
Thoughts on seed oil (366 karma)
LessWrong has been acquired by EA (366 karma)
Four ways learning Econ makes people dumber re: future AI (364 karma)
When Money Is Abundant, Knowledge Is The Real Wealth (364 karma)
My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” (364 karma)
Tsuyoku Naritai! (I Want To Become Stronger) (364 karma)
How To Get Into Independent Research On Alignment/Agency (363 karma)
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa (362 karma)
Recent AI model progress feels mostly like bullshit (361 karma)
Fucking Goddamn Basics of Rationalist Discourse (360 karma)
Universal Basic Income and Poverty (357 karma)
The Martial Art of Rationality (356 karma)
What Goes Without Saying (356 karma)
You don’t know how bad most things are nor precisely how they’re bad. (356 karma)
VDT: a solution to decision theory (356 karma)
AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) (356 karma)
[April Fools’ Day] Introducing Open Asteroid Impact (355 karma)
Paranoia: A Beginner’s Guide (353 karma)
What DALL-E 2 can and cannot do (353 karma)
Optimality is the tiger, and agents are its teeth (353 karma)
Childhoods of exceptional people (351 karma)
Self-Integrity and the Drowning Child (351 karma)
Shallow review of live agendas in alignment & safety (350 karma)
One slightly trollish answer is “someone figures out how to merge minds and this turns out to be a highly desirable thing to do, so most independent observer-moments are before the point where mind-merging is common”.
What are you picturing a “lesion study on GPT” looking like? Naively I imagine something like “train an SAE on the activations at some layer, then determine how often features activate together and then turn that into a distance metric and do clustering/dimensionality reduction, then ablate clusters of features and see how behavior changes”. But I don’t know that I’d particularly expect that to show the GPT as being made of many more parts than I actually think said GPT is made of. But also I don’t have a super clear mental model of how many “parts” a GPT is “made of”, except at the raw mechanical level of layers / attention heads / mlps / whatever (but I’d expect ablating a particular layer of a transformer is probably more analogous to ablating one particular layer of all cortical columns than to lesioning one particular region of the brain).
Yeah, I’m not sure English has a word or even phrase which crisply points to expressing displeasure through the performative withdrawal of previous support (e.g. “performative” in the phrase I used is denotationally correct but has the wrong connotations). Perhaps “conspicuous withdrawal of patronage”? Doesn’t carry any misleading connotations but feels pretty clunky.
If by “gain” you include things like “control of resources even when those resources aren’t clearly tied back to Anthropic”, stealing cryptocurrency.
Ehhhhhhh. If I am a Netflix subscriber, and one of the executives says something I don’t like, and I make a post saying that I’m cancelling my subscription as a result of that that’s entirely within my rights but it’s definitely adversarial.
I find that surprising but expect you to know better than I do there. As such you’relikely right, especially if we use your definition of “lots of inference compute”.
IMO the actually qualitative step change was finding a way to turn vulnerabilities into exploits, which neither Opus or Sonnet did, combined with Mythos doing the vulnerability and exploit analysis autonomously without knowing in advance about the vulnerabilities, and only very basic scaffolding was used
Turning vulnerabilities into exploits is one of those o-ring type tasks where you have to be above the skill floor for all subtasks to end up with a working exploit. Concretely, let’s say a program is missing a bounds check on some stack-allocated variable, and as such you can write arbitrary data to the stack. You should assume that a sufficiently determined attacker can turn that into arbitrary code execution. However, turning a stack buffer overflow into arbitrary code execution is not trivial despite usually being possible. For example, an attacker might take the return-oriented programming approach. Whether Opus could construct a ROP chain would, I think, depend mostly on whether the training included lots of examples of using ropper or some similar tool, and whether the scaffolding made using that tool easy.
There are a bunch of individual components like that, such that even if Opus can often execute each of the steps of transforming a probable vulnerability into a working exploit, the chance of failure compounds and even relatively small improvements in across-the-board robustness can translate to step changes in outcome.
That said, I expect an appropriately-scaffolded Opus, maybe even Sonnet, could demonstrate that it was possible to do an out-of-bounds write to the stack for CVE-2026-4747 (the FreeBSD kerberos thingy). And each of the individual steps to translate that to a working exploit are things where an RL environment could be made to exist, if someone chose to make it exist, and could be open-sourced, if someone chose to do so. I expect (and hope) that nobody will do so, but I expect that the open-source community could replicate at least that fairly mechanical style of exploit generation within the next year if they chose to do so.
This seems like quite a large reputational risk for not all that much money. As far as I can tell the total demand for zero-days probably adds up to a couple billion dollars at current prices. Lowering the prices would increase demand some, but probably not enough to justify approximately any reputational risk to a company with an $800B+ valuation, especially one that’s trying to go public.
As a sanity check, the NSO group made about $250M/year selling zero-days-as-a-service, and ran into substantial legal pressure.
That sounds about right. A dedicated team of security experts can find a hole in anything, but not in everything (i.e. pre-mythos the bottleneck was going from “possible vuln” to “POC”).
Can they detect the problem, or can they develop a working exploit for the problem? My understanding is that the former is much easier than the latter, and that modern systems make it quite obnoxious (but often not impossible for someone who knows what they’re doing) to go from OOB write to arbitrary code execution.
That’s a really good point. I do still predict that the vulnerabilities Mythos found will mostly turn out to be fairly simple things that could have been found by a reasonably skilled but not world-class programmer who knows the mechanics of common vulnerabilities and is also quite familiar with the specifics of this particular codebase. Which, to be clear, is a bar that zero humans clear for most projects. But at this time I can’t exclude the possibility that Mythos can autonomously find rowhammer-level problems.
Sure, I think it might be in poor form to give an exhaustive list of such commits, but here’s a fairly representative sample:
ffmpeg 3e8bec78: buffer overflow
ffmpeg 55bf0e6c: use after free
ffmpeg 39e19693: buffer overflow
libxml2 538b2e38: integer overflow which allows out-of-bounds write
libxml2 edb5f22d: null pointer dereference
libpng 747dd022: null pointer dereference
libplist 6e03a1df: actually not directly pointers-are-hard-related: validate xml structure
Note that the folks doing this were prolific. None of these were particularly impressive in isolation, what’s impressive is the sheer volume of fixes. In pretty much every major library where you’d expect vulns to exist but not have been caught yet due to a lack of eyes, there are 2-5 fixes.
Ehh. A helpful, honest, harmless model is allowed to be good at Diplomacy and other social deception games, to the extent that it can distinguish games from reality.