The Facade of AI Safety Will Crumble

On the eve of superintelligence, real AI safety is a nonexistent field.

The AI companies have embraced something else: safety through psychoanalysis (for lack of a better term). Their safety team concocts various test scenarios for their AI in order to learn various traits of the AI’s “personality”. They even go so far as to mechanistically interpret narrow slices of an AI’s conditional behavior.

The working assumption is that ~~psychoanalysis~~ detailed mechanistic knowledge of conditional behavior can be used to extrapolate useful conclusions about what problems to expect when their AIs grow more powerful.

Embracing this paradigm has been convenient for AI researchers and their companies to feel and act like they’ve been making progress on safety. Unfortunately, while the psychoanalysis / mechanistic behavioral modeling paradigm is excellent for stringing regulators along and talking yourself into being able to sleep at night, it’ll crumble into dust when superintelligence arrives.

The real extinction-level AI safety challenge, the reason we’re nowhere close to surviving superintelligence, is something else — something AI companies decided they won’t mention anymore, because it exposes their AI safety efforts as a shockingly inadequate facade.

Ignore the whole AI psychoanalysis / behavioral modeling paradigm. It’s a distraction. Instead, focus your attention on the level separation between the nature of the work that an intelligence does, and the silly details of still-primitive 2026 intelligence implementations.

The real AI safety challenge is to know what a “superintelligent system that does what we want” would look like, before we build a superintelligent system that doesn’t do what we want. We have very little traction on this question. The only disagreement is over whether we can somehow muddle through — solving the problem incrementally before we run out of time or make a fatal mistake.

Let me explain why the real AI safety challenge isn’t like the fake one.

Ever notice how nothing in the behavior of modern computing applications has any relationship to the silly details of how transistors perform the work of boolean logic? That’s because modern computers are a relatively mature implementation of the abstract dynamic of computation. The level separation is strong with them.

On the other hand, computers made of vacuum tubes, relays, or biological neurons are primitive enough implementations of computation that their silliness does leak into their high-level behavior. The level separation is weak with them.

Consider a human brain, a deeply unserious implementation of a universal computing platform. You’re telling me it’s a computer made out of living cells and chemical neurotransmitters? This bizarre biopunk computing engine in our skull is a toothpick-and-marshmallow structure compared to modern computing electronics.

It would be incorrect to treat brains as a model from which to extrapolate properties of serious computers. Instead, we must see brains as a lower bound for what’s possible in the category of physical implementations of computers.

The brain’s status as a “minimum viable computer” and “minimum viable intelligence” is why the level separations between hardware and psychoanalysis, and between psychoanalysis and intelligence (outcome-steering power), are weak and penetrable with us. It’s why psychoanalysis, and mechanistic knowledge of conditional behavior, are useful tools in the effort to predict human behavior, as in the effort to predict what the next couple generations of still-near-MVP AIs will do.

It would be wrong to dismiss a psychologist by saying, “look, humans just do what proper algorithms on proper computers would do”, because the unserious specs of the brain-as-computer implementation makes the abstraction too leaky:

Short-term memory: 1kb
Max serial steps per second: 20
Max continuous session length: 40 hours
Max accelerated vector geometry dimensions: 3
Native floating-point representation width: 5 bits
Max causal nodes in native utilitarian calculations: 7

But when implementations get better, abstractions stop leaking.

Studying the lower bound of what lets a species of unserious computers get by in the world is fascinating; there’s much to learn about the floor. But the floor is not all there is. If you only study the floor, you don’t notice when things get up and fly.

The narrow-minded analysis of today’s Claude, what passes in the world’s perceived top safety organization as the bread&butter of their non-extinction strategy, is an exercise in turning up lovably quirky implementation details — the equivalent of “lol we got this human subject to select a choice implying that they have a stronger preference for 300 birds than for 10^6 people because the quirky little guy systematically stores quantities as neural weights in such a way as to forget their exponents 😂”.

These psychoanalysis-type results will be utterly irrelevant in the rapidly-approaching world where we advance from the analogue of vacuum tube computers to the analogue of transistor computers on printed circuit boards — the world where a mature implementation of intelligence is on the scene, mature in the sense of achieving decidedly superhuman performance on real-world tests of outcome steering.

We are heading into a world where the level-separation barrier between the abstraction describing the work that the AI does (powerful general goal-to-action mapping) and the implementation details of that abstraction (whatever architectural elements the AI companies throw into the stew to unlock the next tier of performance in the near future) will become strong.

As we have ascended to a degree of seriousness and maturity in the implementation our computing engines (and various other types of engines for that matter), we will also ascend to seriousness and maturity in the implementation of outcome-steering engines.

The outcome-steering effect of future AIs will lose any discernable “flavor” or “personality” leaking from any details of its implementation that used to relate to why the implementation wasn’t robustly superhuman. Even attempts to probe the AI’s “mechanistically interpretable conditional behavior” will offer no predictive power beyond modeling the AI as an outcome-steering engine.

We don’t try to predict what a modern computer will do by asking any question whatsoever about the behavior of electricity in its transistors, even though this was a standard thing engineers did in the early days of computing. Because the level separation came.

The level separation is coming.

But no one is talking as if it is. No one is talking about the day when mature outcome-optimizers arrive, the day when the silly quirks that made the abstractions leak in the older Claudes lose their predictive power, and all we’ve got to work with is the smooth surface of the abstraction:

They steer outcomes better than humans.

This is what’s coming in a single-digit number of years. But the vast majority of AI safety research is psychoanalysis. None of it will help.

And there is no Plan B. No one is working on the problem of “how do you stay in control of something that steers outcomes better than you do” at the only robust level of abstraction. The psychoanalysis is the plan.

It’s an excellent facade. It lets researchers, executives, and policymakers feel relieved that the “AI safety plan” box is checked. It just doesn’t do anything when the superintelligence comes.