Here’s a 2022 Eliezer Yudkowsky tweet:
I find this confusing.
Here’s a question: are object-level facts about the world, like “tires are usually black”, encoded directly in the human-created AGI source code?
If yes, then (1) yo be real, and (2) even if it happened, the source code would wind up with wildly unprecedented length and complexity, so much that it becomes basically inscrutable, just because it’s a complicated world (e.g. there are lots of types of tires, not all of them are black, they stop looking black when they get muddy, etc. etc.). See Yuxi Liu’s ruthless “obituary” of Cyc.
If no, then the human-created source code must be defining a learning algorithm of some sort. And then that learning algorithm will figure out for itself that tires are usually black etc. Might this learning algorithm be simple and legible? Yes! But that was true for GPT-3 too, which Eliezer is clearly putting in the “inscrutable” category in this tweet.
So what is he taking about?
Here are some options!
Maybe the emphasis is on the words “matrices” and “floating point”? Like, if there were 175 billion arrows defining the interconnections of an unprecedentedly-enormous unlabeled Pearlian causal graph, would we feel better about “mortal humans” aligning that? If so, why? “Node 484280985 connects to Node 687664334 with weight 0.034” is still inscrutable! How does that help?
Maybe the emphasis is on “billion”, and e.g. if it were merely 175 million inscrutable floating-point numbers, then mortal humans could align it? Seems far-fetched to me. What about 175,000 inscrutable parameters? Maybe that would help, but there are a LOT of things like “tires are usually black, except when muddy, etc.”. Would that really fit in 175,000 parameters? No way! The average adult knows 20,000 words!
Maybe the emphasis is on “mortal humans”, and what Eliezer meant was that, of course tons of inscrutable parameters were always inevitably gonna be part of AGI, and of course mortal humans will not be able to align such a thing, and that’s why we’re doomed and should stop AI research until we invent smarter humans. But … I did actually have the impression that Eliezer thinks that the large numbers of inscrutable parameters were a bad decision, as opposed to an inevitability. Am I wrong?
Maybe the emphasis is on “inscrutable”, and he’s saying that 175 billion floating point numbers is fine per se, and the problem is that the field of interpretability to date has not developed to the point where we can, umm, scrute them?
Maybe this is just an off-the-cuff tweet from 2022 and I shouldn’t think too hard about it? Could be!
Or something else? I dunno.
Prior related discussions on this forum: Glass box learners want to be black box (Cole Wyeth, 2025) ; “Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds” (1a3orn, 2023) ; “Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc” (Wentworth, 2022) (including my comment on the latter).
Speaking for myself, dunno if this is exactly what Eliezer meant:
The general rule of thumb is that if you want to produce a secure, complex artifact (in any field, not just computer science), you accomplish this by restricting the methods of construction, not by generating an arbitrary artifact using arbitrary methods and then “securing” it later.
If you write a piece of software in a nice formal language using nice software patterns, proving its security can often be pretty easy!
But if you scoop up a binary off the internet that was not written with this in mind, and you want to prove even minimal things about it, you are gonna have a really, really bad time.[1]
So could there be methods that reliably generate “benign” [2] cognitive algorithms?[3] Yes, likely so!
But are there methods that can take 175B FP numbers generated by unknown slop methods and prove them safe? Much more doubtful.
In fact, it can often be basically completely impossible, even for simple problems!
For example, think of the Collatz Conjecture. It’s an extremely simple statement about an extremely simple system that could easily pop up in a “messy” computational system… and currently we can’t prove it, despite massive amounts of effort pouring into it over the years!
What is the solution? Restrict your methods so they never generate artifacts that have “generalized Collatz problems” in them!
As in, it’s tractable for modern humans to prove their “safety”
Probably not encoded as 175B floating point numbers...
This is sloppily presented and false as currently written, and in any case doesn’t support the argument it’s being used for.[1] As a sample illustration of “something” we can prove about it, for all sufficiently large n, at least n0.84 integers between 1 and n eventually reach 1 once the algorithm is applied to them.[2] There have been other important results on this topic over the years,[3] though of course the conjecture itself hasn’t yet been resolved.
I’m sure there have to be examples out there of “simple” dynamical systems or related mathematical objects that we can’t prove anything interesting about,[4] but this one ain’t it.
Something something local validity is the key to sanity and civilization
Source
Example source
Though, as per the conservation of expected evidence, seeing that the example chosen to support of a thesis is incorrect results in a downwards update on both the validity of the thesis and on its relevance. Likely low in magnitude in this case
Thanks for pointing out my imprecise statement there! What I meant of course “is we can’t prove the Collatz Conjecture” (which is a simple statement about a simple dynamic system), but I wrote something that doesn’t precisely say that, so apologies for that.
The main thing I intended to convey here is that the amounts of effort going into proving simple things (including the things you have mentioned that were in fact proven!) are often extremely unintuitively high to people not familiar with this, and that this happens all over CS and math.
I found Connor’s text very helpful and illuminating!
…But yeah, I agree about sloppy wording.
Instead of “you want to prove even minimal things about it” I think he should have said “you want to prove certain important things about it”. Or actually, he could have even said “you want to have an informed guess about certain important things about it”. Maybe a better example would be “it doesn’t contain a backdoor”—it’s trivial if you’re writing the code yourself, hard for a binary blob you find on the internet. Having access to someone else’s source code helps but is not foolproof, especially at scale, (e.g.).
Well, hmm, I guess it’s tautological that if you’re writing your own code, you can reliably not put backdoors in it. There’s no such thing as an “accidental backdoor”. If it’s accidental then you would call it a “security flaw” instead. But speaking of which, it’s also true that security flaws are much easier to detect or rule out if you’re writing the code yourself than if you find a binary blob on the internet.
Or the halting problem: it’s super-easy to write code that will definitely halt, but there are at least some binary blobs for which it is impossible in practice to know or even guess whether it will halt or not.
…Also, while we’re nitpicking, Connor wrote “175B FP numbers generated by unknown slop methods”. I would have instead said “175B FP numbers generated by gradient descent on internet text” or “175B FP numbers generated by a learning algorithm with no known security-related invariants” or something. (“Unknown” is false and “slop” seems to be just throwing shade (in this context).)
There is such thing as accidental backdoor: not properly escaping strings embeded in other strings, like SQL injection, or prompt injection
My impression is that if you walk of to a security researcher and say “hey what do you call the kind of thing where for example you’re not properly escaping strings embedded in other strings, like SQL injection?”, they probably wouldn’t say “oh that thing is called an accidental backdoor”, rather they would say “oh that thing is called a security vulnerability”.
(This is purely a terminology discussion, I’m sure we agree about how SQL injection works.)
I guess “backdoor” suggests access being exclusive to person who planted it, while “vulnerability” is something exploitable by everyone? Also, after thinking a bit more about it, I think you’re right that “backdoor” implies some intentionality, and perhaps accidental backdoor is an oxymoron.
And in particular, Collatz has been confirmed for all numbers up to 2^71. So if it turns up in a context of 64-bit integers, we know it holds.
As I understand it, the initial Yudkowskian conception of Friendly AI research[1] was for a small, math- and science-inclined team that’s been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:
Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity’s CEV, and initialize a Seed AI with a ton of “hacky guardrails” (Eliezer’s own term) aimed at enacting it. Initially the AI would be pretty dumb, but:
we would know precisely what it’s trying to do, because we would have hardcoded its desires directly.
we would know precisely how it would develop, because our Deep Mathematical Knowledge about agency and self-improvement would have resulted in clear mathematical proofs of how it will preserve its goals (and thus its Friendliness) as it self-improved.
the hacky guardrails would ensure nothing breaks at the beginning, and as the model got better and its beliefs/actions/desires coherentized, the problems with the approximation of CEV would go away.
So the point is that we might not know the internals of the final version of the FAI; it might be “inscrutable.” But that’s ok, they said, because we’d know with the certainty of mathematical proof that its goals are nonetheless good.
From there on out, you relax, kick back, and plan the Singularity after-party.
Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too
this was never going to work...
… without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I’d expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.
exactly what to prove might end up looking a bit different, of course.
Why do you think it was never going to work? Even if you think humans aren’t smart enough, intelligence enhancement seems pretty likely.
MIRI lost the Mandate of Heaven smh
When and why?
I think the emphasis is on inscrutable. If you didn’t already know how the deep learning tech tree would go, you could have figured out that Cyc-like hard-coding of “tires are black” &c. is not the way, but you might have hoped that the nature of the learning algorithm would naturally lend itself to a reasonably detailed understanding of the learned content: that the learning algorithm produces “concept” data-structures in this-and-such format which accomplish these-and-such cognitive tasks in this-and-such way, even if there are a billion concepts.
This is what I think he means:
The object-level facts are not written by or comprehensible to humans, no. What’s comprehensible is the algorithm the AI agent uses to form beliefs and make decisions based on those beliefs. Yudkowsky often compares gradient descent optimizing a model to evolution optimizing brains, so he seems to think that understanding the outer optimization algorithm is separate from understanding the inner algorithms of the neural network’s “mind”.
I think what he imagines as a non-inscrutable AI design is something vaguely like “This module takes in sense data and uses it to generate beliefs about the world which are represented as X and updated with algorithm Y, and algorithm Z generates actions, and they’re graded with a utility function represented as W, and we can prove theorems and do experiments with all these things in order to make confident claims about what the whole system will do.”(The true design would be way more complicated, but still comprehensible.)
Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:
Decision-making processes and ontologies are separable. An AI system’s decision process can be isolated and connected to a different world-model, or vice versa.
The decision-making process is human-comprehensible and has a much shorter description length than the ontology.
As AI systems become more powerful, their decision processes approach a theoretically optimal decision theory that can also be succinctly expressed and understood by human researchers.
None of these assumptions ended up being true of LLMs. In an LLM, the world-model and decision process are mixed together in a single neural network instead of being separate entities. LLMs don’t come with decision-related concepts like “hypothesis” and “causality” pre-loaded; those concepts are learned over the course of training and are represented in the same messy, polysemantic way as any other learned concept. There’s no way to separate out the reasoning-related features to get a decision process you could plug into a different world-model. In addition, when LLMs are scaled up, their decision-making becomes more complex and inscrutable due to being distributed across the neural network. The LLM’s decision-making process doesn’t converge into a simple and human-comprehensible decision theory.
I think a natural way to think about this is ala AIXI, cleaving the system in two, prediction/world-modeling and values.
With this framing, I think most people would be fine with the world-model being inscrutable, if you can be confident the values are the right ones (and they need to be scrutable for this). I mean, for an ASI this kind of has to be the case. It will understand many thing about the world that we don’t understand. But the values can be arbitrarily simple.
Kind of my hope for mechanistic interpretability is that we could find something isomorphic to a aixi (with inscrutable neural goop instead of turing machines) and then do surgery on the sensory reward part. And that this is feasible because 1) the aixi-like structure is scrutable 2) the values the AI has gotten from training probably will not be scrutable, but we can replace them with something that is, at least “structurally”, scrutable.
Be a little careful with this. It’s possible to make the AI do all sorts of strange things via unusual world models. Ie a paperclip maximizing AI can believe “everything you see is a simulation, but the simulators will make paperclips in the real world if you do X”
If your confident that the world model is true, I think this isn’t a problem.
I hesitate to say “confident”. But I think you’re not gonna have world models emerging LLMs that are wrapped in a “this is a simulation” layer.. probably?
Also maybe even if they did, the procedure I’m describing, if it worked at all, would naively make them care about some simulated thing for its own sake. Not care about the simulated thing for instrumental reasons so it could get some other thing in the real world.
And so under this world model we feel doomy about any system which decides based on its values which parts of the world are salient and worth modeling in detail, and which defines its values in terms of learned bits of its world model?
Yeah I think so, if I understand you correctly. Even on this view, you’re gonna have to interface with parts of the learned world-model. Because that’s the ontology you’ll have to use when you specify the value function. Else I don’t see how you’d get it in a format compatible with the searchy parts of the models cognition.
So you’ll get two problems:
Maybe the model doesn’t model the things we care about
Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I’m describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die.
My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.
As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.
Yeah, I think the hope of us fully understanding a learned system was a fool’s errand, and the dream of full interpretability was never actually possible (because of the very complicated sequences of the world and the fact that indexical complexity means your brain needs to be even more complicated, and Shane Legg proved that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself).
To be frank, I think @Shane_Legg’s paper predicted a lot of the reason why MIRI’s efforts didn’t work, because in practice, a computable theory of learning was just way more complicated than people thought at the time, and it turned out there was no clever shortcut, and the sorts of things that are easy to white-box are also the things that we can’t get because they aren’t computable by a Turing Machine.
More generally, one of the flaws in hindsight of early LW work, especially before 2012-2013, was not realizing that their attempts to relax the problem by introducing hypercomputers didn’t work to give us new ideas, and the relaxed problem had no relation to the real problem of making AI safe as AI progresses in this world, such that solutions for one problem fail to transfer to the other problem.
Here’s your citation, @Steven Byrnes for the claim that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself.
Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):
https://arxiv.org/abs/cs/0606070
The immediate corollary is as AI gets better, it’s going to be inevitably more and more complicated by default, and it’s not going to be any easier to interpret AIs, it will just get harder and harder to interpret the learned parts of the AI.
I think some of the optimism about scrutability might derive from reductionism. Like, if you’ve got a scrutable algorithm for maintaining a multilevel map, and you’ve got a scrutable model of the chemistry of a tire, you could pass through the multilevel model to find the higher-level description of the tire.
I suspect the crux here is whether or not you believe it’s possible to have a “simple” model of intelligence. Intuitively, the question here is something like, “Does intelligence ultimately boil down to some kind of fancy logic? Or does it boil down to some kind of fancy linear algebra?”
The “fancy logic” view has a long history. When I started working as a programmer, my coworkers were veterans of the 80s AI boom and the following “AI winter.” The key hope of those 80s expert systems was that you could encode knowledge using definitions and rules. This failed.
But the “fancy linear algebra” view pull ahead long ago. In the 90s, researchers in computational linguistics, computer vision and classification realized that linear algebra worked far better than fancy collections of rules. Many of these subfields leaped ahead. There were dissenters: Cyc continued to struggle off in a corner somewhere, and the semantic web tried to badly reinvent Prolog. The dissenters failed.
The dream of Cyc-like systems is eternal, and each new generation reinvents it. But it has systematically lost on nearly every benchmark of intelligence.
Fundamentally, real world intelligence has a number of properties:
The input is a big pile of numbers. Images are a pile of numbers. Sound is a pile of numbers.
Processing that input requires weighing many different pieces of evidence in complex ways.
The output of intelligence is a probability distribution. This is most obvious for tasks like speech recognition (“Did they say X? Probably. But they might have said Y.”)
When you have a giant pile of numbers as input, a complex system for weighing those numbers, and a probability distribution as output, then your system is inevitably something very much like a giant matrix. (In practice, it turns out you need a bunch of smaller matrices connected by non-linearities.)
Before 2022, it appeared to me that Yudkowsky was trapped in the same mirage that trapped the creators of Cyc and the Semantic Web and 80s expert systems.
But in 2025, Yudkowsky appears to believe that the current threat absolutely comes from giant inscrutable matrices. And as far as I can tell, he has become very pessimistic about any kind of robust “alignment”.
Personally, this is also my viewpoint: There is almost certainly no robust version of alignment, and even “approximate alignment will come under vast strain if we develop superhuman systems with goals. So I would answer your question in the affirmative: As far as I can see, inscrutability was always inevitable.
I appreciate the clear argument as to why “fancy linear algebra” works better than “fancy logic”.
And I understand why things that work better tend to get selected.
I do challenge “inevitable” though. It doesn’t help us to survive.
If linear algebra probably kills everyone but logic probably doesn’t, tell everyone and agree to prefer to use the thing that works worse.
Thank you for your response!
To clarify, my argument is that:
Logic- and rule-based systems fell behind in the 90s. And I don’t see any way that they are ever likely to work, even if we had decades to work on them.
Systems with massive numbers of numeric parameters have worked exceptionally well, in many forms. Unfortunately, they’re opaque and unpredictable, and therefore unsafe.
Given these two assumptions, the only two safety strategies are: a. A permanent, worldwide halt, almost certainly within the next 5-10 years. b. Build something smarter and eventually more powerful than us, and hope it likes keeping humans as pets, and does a reasonable job of it.
I strongly support (3a). But this is a hard argument to make, because the key step of the argument is that “almost every successful AI algorithm of the past 30 years has been an opaque mass of numbers, and it has gotten worse with each generation.”
Anyway, thank you for giving me an opportunity to try to explain my argument a bit better!
Simple first-order learning algorithms have types of patterns they recognize, and meta-learning algorithms also have types of patterns they like.
In order to make a friendly or aligned AI, we will have to have some insight into what types of patterns we are going to have it recognize, and separately what types of things it is going to like or find salient.
There was a simple calculation protocol which generated GPT-3. The part that was not simple was translating that into predicting its preferences or perceptual landscape, and hence what it would do after it was turned on. And if you can’t predict how a parameter will respond to input, you can’t architect it one-shot.