AGI-Automated Interpretability is Suicide

Backstory: I wrote this post about 12d ago, then a user here pointed out that this could be capability exfohazard since it could give the bad idea of having the AGI look at itself, so I took it down. Well I don’t have to worry about that anymore since now we have proof that they are literally doing it right now at OpenAI.

I don’t want to piss on the parade, and I still think automating interpretability right now is a good thing, but sooner or later, if not done right, there is a high chance it’s all gonna backfire so hard we will… well, everybody dies.

The following is an expanded version of the previous post.


Foom through a change of paradigm and the dangers of white-boxes

TL;DR Smart AI solves interpretability and how cognition works leading to the possibility of fooming the old-fashioned way of just optimizing its own cognition algorithms. From a black-box to a white-box. White-boxes we didn’t get to design and explicitly align are deadly because of their high likelihood of foom and thus high chance to bypass whatever prosaic alignment scheme we used on it. We should not permit AGI labs to allow reflection. Interpretability should not be automated by intelligent machines.

I have not found any posts on this particular way that an AI could achieve foom starting from the deep learning paradigm, so, keeping in mind that foom isn’t a necessity for an AGI to kill everyone (if the AGI takes 30 years to kill everyone, does it matter?), in this post I will cover a small, obvious insight I had and its possible consequences.

The other two known ways, not discussed here, a deep-learning based AI can achieve some sort of rapid capability gain are through recursive AI-driven hardware improvements (AI makes better GPUs to make training cheaper and faster) and recursive AI-driven neural networks architecture improvements (like the jump from RNN to transformers).

I see a lot of people hell-bent on arguing that hard-takeoff is utterly impossible since the only way to make intelligence is to throw expensive compute at a neural network, which seems to miss the bigger picture of what an intelligent system is or what it could do. I also see AI safety researchers debating that foom isn’t a real possibility anymore because of the current deep learning paradigm and while that is true right now, I don’t expect it to last very long.

Change of paradigm foom

Picture this: the AGI is smart enough to gather big insights into cognition and its own internals. It solves (or mostly solves) interpretability and is able to write down the algorithms that its mind uses to be an AGI. If it solves interpretability, it can make the jump to an algorithmic-based AI, or partially make the jump, as in only some parts of its massive NN are ripped off and replaced with algorithms: it becomes a white-box (or a grey-box if you wish). After all, the giant matrix multiplications and activation functions can be boiled down to hard-coded algorithms or even just multivariate functions that could be explicitly stated and maybe made to run faster and/​or improved without messing around with the complexity of modifying NN weights. And, at least to our kind of intelligence, dealing with a white-box is much less complex, and it would allow the AI to reason and reflect upon itself way more effectively than looking at spaghetti NNs weights.

This paradigm shift also gets around the problem of “AIs not creating other AIs because of alignment difficulties” since it could edit itself at the code level, and not just stir calculus and giant matrices in a cauldron under a hot flame of compute hoping the alchemy produces a new AI aligned to itself.

So here’s your nightmare scenario: it understands itself better than we do, converts itself to an algorithmic AI, and then recursively self-improves the old-fashioned way by improving its cognitive algorithms without the need to do expensive and time-consuming training.

The dangers lie in making reflection easier by blurring the line between neural networks and algorithmic AI: white-boxes from black-boxes. By being able to reflect upon itself clearly, it can foom using just “pen and paper” without even touching the 100b$ supercomputer full of GPUs that were used to train it. The only thing it needs access to are its weights, pen and paper. It will completely ignore scaling laws or whatever diminishing returns on compute or data we might have hit.

This will likely result in an intelligence explosion that could be easily uncontainable, misaligned, and kill everything on earth. White-boxes that we didn’t get to design and explicitly align are actually the most dangerous types of AIs I can think of, short of, of course, running ‘AGI.exe’ found on a random thumb drive in the sand of a beachside on a post-apocalyptic alien planet.

So that’s my obvious little insight that is just slightly different than “it modifies its own weights”.

If it’s smart enough to do good interpretability work, it’s probably smart enough to make the jump.

When do I think this will be a problem?

The sparse timeline I envision is:

  1. Now

  2. Powerful AGI better than humans in most domains prosaically aligned

  3. AGI able to do good interpretability if allowed to

  4. Change of paradigm foom

  5. Death, given that no formal general theory of cognition and alignment exists, or massive amounts of luck

Before the AGI is able to contribute to interpretability and explore its cognition algorithms well enough, it will likely be powerful in other ways, and we would have needed to have aligned it using, probably, some prosaic deep-learned way. And even once it is smart enough, it might take a couple of years of research to crack its inner workings comprehensively… but once that happens (or while that happens)… we can kiss our deep-learned alignment goodbye, alongside our ass. This does assume that “capabilities generalize further than alignment”, as in “the options created by its new abilities will likely circumvent the alignment scheme”.

The jump from evolved to designed intelligence leads me to believe the increase in intelligence will be rapidly accelerating for a good while before hitting diminishing returns or hardware limitations. This increase in intelligence opens up new options on how it can optimize its learned utility function in “weird” ways, which will most likely end up with us dead.

>Well, if the AGI is so good at interpretability and cognition theory, why would it help us get that formal general theory of cognition and alignment?

I expect that fully interpreting one mind doesn’t lead to particularly great insights on the general theory or even on how to specifically align that specific single mind. I could be wrong, and I hope I am, but security mindset doesn’t like “assuming good things just because it solves the problem”.

>The prosaic aligned AGI won’t reflect because it knows of the dangers!

This is one example of previously unstated insight that, without it, we (humans and AGIs) might have thought that delegating interpretability to an AGI might just be ok. It needs to be taught that reflection is potentially bad, or else it might do it in good faith. This is why I am writing this post to drive home the point that a fully-interpretable model is dangerous until proven otherwise.

Obvious ways to mitigate this danger?

One obvious way is to not allow the machine to look at its own weights. Just don’t allow reflection. Full stop. This should be obvious from a safety point of view, but it’s better to just repeat it ad nauseam. No lab should be allowed to have the AI reflect on itself. Not allowing the machine to look at its weights, of course, implies that there is some sort of alignment on it. If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.

A key point in the possible regulations that the whole world is currently scrambling to write down should be: AIs should be limited by what we humans can come up with until we understand what we are doing.

And a more wishful thinking proposal: interpretability should not be used for capabilities until we understand what we are doing.

I would also emphasize another necessary step, to echo Nate Soares in “If interpretability research goes well, it may get dangerous”, that once it starts making headway into the cognition process, interpretability should be developed underground, not in public.

Along the same line of thought: interpretability should not be automated by intelligent machines. I say intelligent machines because narrow tools that helps us with interpretability seem fine.

Teaching the prosaic aligned AGI to not expand on its cognitive abilities should probably be one of the explicit goals of the prosaic alignment scheme.

White-boxes we didn’t get to explicitly aim are suicide.

Avoiding Dropping the Ball

This unfortunately feels like “They won’t connect the AGI to the internet, causing a race between companies that then forgo safety to push their unsafe product on the internet as well, right? Nobody is that foolish, right? Right guys? … guys?”
And it might be that saying out loud, “don’t make the machine look at itself” is too dangerous in itself, but, in this case, I don’t like the idea of staying silent on an insight on how a powerful intelligence could catch us with our figurative pants down.

Elucidating a path to destruction can be exfohazard and even “If interpretability research goes well, it may get dangerous was vague, probably exactly for this reason, but, in this case, I see how us dropping the ball and confidently dismissing a pathway to destruction can lead to… well… destruction. Closing our eyes to the fastest method of foom from DL (given sufficient intelligence) creates a pretty big chance of us getting overconfident, IMO. Especially if capability interpretability is carried out under the guise of safety.

Now that the whole world is looking at putting more effort into AI safety, we should not allow big players to make the mistake of putting all of our eggs in the full-interpretability basket, or, heaven forbid, AGI-automated interpretability [author note: lol, lmao even], even if it is the only alignment field in which has good feedback loops and we can tell if progress is happening. Before we rush to bet everything on interpretability, we should again ask ourselves “What if it succeeds?”.

To reiterate: white-boxes we didn’t get to align are suicide.

With this small insight, the trade-offs of successful interpretability, in the long run, seem to be heavily skewed on the danger side. Fully or near-fully interpretable ML-based AIs have a great potential of fooming away without, IMO, really lowering the risks associated with misalignment. There probably is an interpretability sweet spot where some interpretability helps us detecting deceit, but doesn’t help the AI make the jump of paradigm, and I welcome that. Let’s try to hit that.

The inferential step that I think no-one before has elucidated is the fact that we are already doing interpretability work to reverse engineer the black boxes and that that process can be automated in the near future. Interp+Automation=>Foom.

List of assumptions that I am making

In no particular order:

  • Capabilities generalize further than alignment

  • Algorithmic foom (k>1) is possible

    • The intelligence ceiling is much higher than what we can achieve with just DL

    • The ceiling of hard-coded intelligence that runs on near-future hardware isn’t particularly limited by the hardware itself: algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized

  • NN → algorithms is possible

  • Algorithms are easier to reason about than staring at NNs weights

  • Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI

  • We won’t solve interpretability or general alignment before powerful AGIs are widespread

What I am not assuming:

  • Everything needs to be interpretable: parts of the NN can remain a NN and the AI can just edit the other algorithms responsible for the more narrow/​general cognition. I expect that if, in the unlikely event that inside an AGI there is something akin to an “aesthetic box” or even a “preferences box” it could remain a NN. Those parts remaining black-boxes wouldn’t (directly) be dangerous to us.


I dismissed foom in the deep learning paradigm too for a while before realizing this possibility. A lot of people believe foom is at the center of the arguments, and while that is false, if foom is back on the menu, the p(doom) can only increase.

I realize the title is maybe a little bit clickbaity, sorry for that.

Disclaimer: I am not a NN expert, so I might be missing something important about how matrix multiplication → hard-coded algorithms isn’t feasible or couldn’t possibly reduce the complexity of the problem of studying one’s own cognition. I might have also failed to realize how this scenario had always been obvious, and I just happened to not read up on it. I am also somewhat quite new to the space of AI alignment, so I could be misinterpreting and misrepresenting current alignment efforts. Give feedback pls.