jdp

Karma: 1,663

jdp 16 Mar 2026 22:11 UTC
19 points
10
in reply to: David Udell’s comment on: Terrified Comments on Corrigibility in Claude’s Constitution

But confusion is in the map, not in the territory,

Confusion can in fact be in the irreducible complexity and therefore in the territory. “It is not possible to represent the ‘organizing principle’ of this network in fewer than 500 million parameters, which do not fit into any English statement or even conceivably humanly readable series of English statements.”, Shannon entropy can be like that sometimes.

Rather, I would feel awful, because in your sketched-out world you just cannot realistically reach the level of understanding you would need to feel safe ceding the trump card of being the smartest kind of thing around.

I think there are achievable alignment paths that don’t flow through precise mechanistic interpretability. I should write about some of them. But also I don’t think what I’m saying precludes as you say having understanding of individual phenomena in the network, it’s mostly an argument against there being a way more legible way you could have done this if people had just listened to you, that is not probably not true and your ego has to let it go. You have to accept the constraints of the problem as they appear to present themselves.

Well, you don’t have to do anything but unless you have some kind of deep fundamental insight here your prior should be that successful alignment plans look more like replying on convergence properties than they do on aesthetically beautiful ‘clean room’ cognitive architecture designs. There might be some value in decomposing GPT into parts, but I would submit these parts are still going to form a system that is very difficult to predict all the downstream consequences of in the way I think people usually imply when they say these things. You know, that they want it to be like a rocket launch where we can know in principle what coordinate position X, Y, Z we will be in at time t. I think the kinds of properties we can guarantee will be more like “we wind up somewhere in this general region in a tractable amount of time so long as an act of god does not derail us”.

jdp 16 Mar 2026 12:15 UTC
59 points
2
on: Terrified Comments on Corrigibility in Claude’s Constitution
[Lightly Claude-cleaned Transcript of me talking]

So I know it’s beside the point of your post, and by no means the core thesis, but I can’t help but notice that in your prologue you write this:

“A serious, believable AI alignment agenda would be grounded in a deep mechanistic understanding of both intelligence and human values. Its masters of mind engineering would understand how every part of the human brain works and how the parts fit together to comprise what their ignorant predecessors would have thought of as a person. They would see the cognitive work done by each part and know how to write code that accomplishes the same work in pure form.”

I have to admit this bugs me. It bugs me specifically because it triggers my pet peeve of “if only we had done the previous AI paradigm better, we wouldn’t be in this mess.” The reason why this bugs me is it tells me that the speaker, the writer, the author has not really learned the core lessons of deep learning. They have not really gotten it. So I’m going to yap into my phone and try to explain — probably not for the last time; I’d like to hope it’s the last time, but I know better, I’ll probably have to explain this over and over.

I want to try to explain why I think this is just not a good mindset to be in, not a good way to think about things, and in fact why it focuses you on possibilities and solutions that do not exist. More importantly, it means you’ve failed to grasp important dimensions of alignment as a problem, because you’ve failed to grasp important dimensions of AI as a field.

I think we can separate AI into multiple eras and multiple paradigms. If you look at these paradigms, there’s a lot of discussion about AI where the warrant for taking a particular concept seriously is kind of buried under old lore that, if you then examine it, makes the position much more absurd or much less easily justifiable than if you were just encountering it fresh — never having been suggested to you by certain pieces of evidence at certain times.

I would say that AI as a concept gets started in the 50s with the MIT AI lab. The very first AI paradigm is just fiddling around. There is no paradigm. The early definition of AI would include many things that we would now just consider software — compilers, for example, were at one point considered AI research. Basically any form of automation of human reasoning or cognitive labor was considered AI. That’s a very broad definition, and it lasts for a while. My recollection — and I’m just yapping into my phone rather than consulting a book — is that this lasts maybe until the late 60s, early 70s, when you get the first real AI paradigm: grammar-based AI.

It’s also important to remember how naive the early AI pioneers were. There’s the famous statement from the Dartmouth conference where they say something like, “we think if you put a handful of dedicated students on this problem, we’ll have this whole AGI thing solved in six months.” Just wildly, naively optimistic, and for quite a number of years. You can find interviews from the 60s where AI researchers believe they’re going to have what we would now basically consider AGI within a single-digit number of years. It in fact contributed to the first wave of major automation panic in the 60s — but that’s a different subject and I’d have to do a bunch of research to really do it justice.

The point is that it took time to be disabused of the notion that we were going to have AGI in a couple of years because we had the computer. Why did people ever think this in the first place? You look at all the computing power needed to do deep learning, you look at the computational requirements to run even a good compiler, and these computers back then were tiny — literally kilobytes of RAM, minuscule CPU power, minuscule memory. How could they ever think they were on the verge of AGI?

The answer is that their reasoning went: the kinds of computations the computer can be programmed to do — math problems, calculus problems — are the hardest human cognitive abilities. The things the computer does so easily are the hardest things for a human to do. Therefore, the reasoning went, if we’re already starting from a baseline of the hardest things a human can do, it should be very easy to get to the easiest things — like walking.

And this is where the naive wild over-optimism comes from. What we eventually learned was that walking is very hard. Even piloting a little insect body is very hard. Replicating the behavior of an insect — the pathfinding, the proprioceptive awareness, the environmental awareness of an insect — is quite difficult. Especially on that kind of hardware, it’s basically impossible.

Once people started to realize this, they settled into the first real AI paradigm: grammar-based AI. What people figured was that you have these compilers — the Fortran compiler, the Lisp interpreter had been invented by then, along with some elaborations. Compilers seem to be capable of doing complex cognitive work. They can unroll a loop, they can do these intricate programming tasks that previously required a dedicated person to hand-specify all the behaviors. A compiler is capable of fairly complex translation between a high-level program and the detailed behaviors the machine should do to implement that behavior efficiently — behaviors that previously would have had to be hand-specified by a programmer.

For anyone unfamiliar with compilers: the way a compiler basically works, as a vast oversimplification, is that it has a series of rules in what’s called a context-free grammar. The thing that distinguishes a context-free grammar from a natural grammar is that you are never reliant on context outside the statement itself for the meaning of the statement — or at least, any context you need, like a variable name, is formally available to the compiler. Statements in a context-free grammar have no ambiguity; there is always an unambiguous, final string you can arrive at. You never have to decide between two ambiguous interpretations based on context.

The thought process was: we have these compilers, and they seem capable of using a series of formal language steps to take high-level intentions from a person and translate them into behaviors. They even have, at least the appearance of, autonomy. Compilers are capable of thinking of ways to express the behavior of high-level code that the programmer might not even have thought of. There’s a sense of genuine cognitive autonomy from the programmer — you’re able to get out more than you’re putting in. I think there’s a metaphor like “some brains are like fishing, you put one idea in and you get two ideas out.” That seems like it was kind of the core intuition behind formal grammar AI: that a compiler follows individually understandable rules and yet produces behaviors that express what the programmer meant through ways the programmer would not have thought of themselves. You start to feel the machine becoming autonomous, which is very attractive.

This also lined up with the theories of thinkers like Noam Chomsky. The entire concept of the context-free grammar as distinct from the natural grammar is, my understanding is, a Chomsky concept. So it’s really the Chomsky era of AI. This is the era of systems like EURISKO. You also have computer algebra systems — Maxima being the classic example. A computer algebra system is the kind of thing that now we’d just consider software, but at the time it was considered AI.

This is one of the things John McCarthy famously complained about when he said, “if it starts to work, they stop calling it AI.” When they were developing systems like Maxima, those were considered AI. And what they were, were systems where you could give it an algebra expression and it would do the cognitive labor of reducing it to its final form using a series of production rules — which is everything a compiler does, as I was trying to explain. A compiler starts with a statement expressed in a formal grammar, applies a series of production rules — which you can think of as heuristics — and the grammar specification basically tells you: given this state of the expression, what is the next state I should transition to? You go through any number of steps until you reach a terminal, a state from which there are no more production rules to apply. It’s the final answer. When you’re doing algebra and you take a complex expression and reduce it to its simplest form using a series of steps, that’s basically what this is: applying production rules within a formal grammar to reduce it to a terminal state.

I’m not saying these systems were useless, especially the more practically focused ones like Maxima. But in terms of delivering autonomous, interesting thinking AI, they’re pretty lackluster. I think the closest we got, arguably, was EURISKO, and I’m kind of inclined to think that EURISKO is sort of fake. I don’t really believe most of that story.

The formal grammar paradigm has a couple of problems. I think the core problem is articulated fairly well by Allen Newell in the final lecture he gives before he dies. The core problem is something like: let’s ignore the problem of the production rules for a minute. Let’s say your production rules are perfect — you have a perfect set of problem-solving heuristics that can take you from a starting symbolic program state to a final problem solution. It doesn’t matter how brilliant your problem-solving heuristics are if you can’t even start the problem off in the right state.

To give a concrete example I use all the time: you want to go downstairs and fetch a jug of milk from the fridge. This is a task that essentially any person can do. Even people who score as mentally disabled on an IQ test can generally go down the stairs and grab a jug of milk from the fridge. It’s so basic we don’t even think of it as difficult. But then think about how you’d get a robot to do that autonomously — not programming it step by step to do one exact mechanical motion, but saying “hey, go grab me a jug of milk” and having it walk down the stairs, walk to the fridge, open the fridge, recognize the milk jug, grab it, and walk back. It’s completely intractable. It’s not just that the problem-solving heuristics can’t do it — the formal grammar approach of taking a formal symbol set and applying transformations to it cannot do this thing even in principle. There is no humanly conceivable set of problem-solving heuristics that is going to let you, starting from a raw bitmap of a room or hallway or stairs, autonomously identify the relevant features of the problem at each stage and accomplish the task. Not happening. And it’s not that it’s not happening because you’re not good enough. It’s not happening because the whole paradigm has no way to even conceive of how it would do this.

I could go into all kinds of reasons why problem-solving heuristics based on a formal grammar are just going to be intractable, but I do think Allen Newell has it exactly right. The fundamental problem is not just that this thing isn’t good enough — it really cannot be good enough even in principle. But even if you have the production rules part perfect, your paradigm still doesn’t even have a way in principle to do this extremely important thing that you would always want your AI to do and that humans empirically can do. So you can’t just appeal to its fundamental impossibility; clearly, there is a way to do this.

I really like the way Allen Newell phrases this when he says that the purpose of cognitive architecture as a field is to try to answer the question: how can the human mind occur in the physical universe? He threw that out as an articulation of the core question in his final lecture. I think it’s brilliant. We can now ask a different but closely related question: how can GPT occur in the physical universe? The difference is that this question is much more tractable.

So formal grammar AI didn’t work, and yet it was pursued for a very long time — arguably even as recently as the 90s, there were people genuinely still working on it. It never really died culturally or academically. I think the reason it never died academically is that it’s just aesthetically satisfying. Looking back on it, I think Dreyfus comparing it to alchemy was completely appropriate. It’s basically the Philosopher’s Stone — this very nice feel-good thing that it would be really cool if you could do. It’s an appealing myth, an attractive object in latent space that draws people towards it but from which they can’t escape. It’s an illusion. I honestly do not think formal grammar-based AI is a thing permitted by our universe to exist, at least not in the kind of way its creators envisioned it.

So what else can you do? The next paradigm is something like Victor Glushkov’s genetic algorithms. The idea there is probably quite similar to deep learning, but deep learning implements it in a way that is actually practically implementable. The way genetic algorithms are supposed to work is that you implement a cost function — what we today call a loss function — and you’re going to use random mutations on some discrete symbolic representation of the problem or solution. The cost function tells you if you are getting closer or farther from the solution, which means your problem needs to be at least differentiable in the sense that there’s a clear, objective way to score the performance of a solution and the scoring can be granular enough that you can know if you’re getting closer or farther based on small changes.

The first big problem you run into is that random mutations and discrete programs do not mix together well. How do you make a program representation where you can do these kinds of mutations? You need mutations that have a regular structure so they don’t just destroy your programs, or you need a form of program representation that works well under the presence of random mutations. That’s just really hard to do with discrete programs. I don’t think anyone ever really cracked it.

The other problem, which is related, is the credit assignment problem. You know, one good idea is: what if we constrain our mutations to the parts of the program that are not working? If we know roughly where the error is, we can constrain our mutations to that part instead of breaking random stuff that is functioning. That’s a great idea and it will definitely narrow your search space. But how do you do that? Unless you have some way to take the cost function and calculate the gradient of change with respect to the program representation, there’s no way to find the part of the program you need to modify. So what you end up doing is random mutations, and the search space is just way too wide.

Based on the intractability of this particular approach, a lot of people concluded that AGI was just not possible. There used to be a very common story that went something like: we can’t do AGI because human intelligence is the product of a huge program search undergone by evolution, and the way evolution did it was by throwing the equivalent of zettaflops of CPU processing power at it — amounts of compute we’ll just never have access to. Therefore, we’re not going to have AGI anytime this century, if ever, because you would basically have to recapitulate all of evolution to get something comparable to a human brain. And we know this because we tried the Glushkov thing and it did not work. I think you can see how that prediction turned out. But it was plausible at the time.

The other thing people started doing that was actually quite practical was expert systems. The way an expert system works is basically that you have a knowledge base and a decision tree. Where you get the decision tree is you take an actual human expert who knows how to do a task — say, flying an airplane — and you formally represent the problem state in a way legible to the decision tree. You just copy what a human would do at each state. These things often didn’t generalize very well, but if you did enough hours of human instruction and put the system into enough situations with a human instructor and recorded enough data and put it into a large enough decision tree with a large enough state space and had even a slight compressive mechanism for generalization — this was enough to do certain tasks, or at least start to approximate them, even if it would then catastrophically fail in an unanticipated situation.

And the thing is, this reminds one a lot of deep learning. I’m not saying deep learning is literally just a giant decision tree — I think the generalization properties of deep learning are too good for that. But deep learning does in fact have bizarre catastrophic failures out of distribution and is very reliant on having training examples for a particular thing. This story sounds very familiar. The expert system was also famously inscrutable. You’d make one, and you could ask how it accomplishes a task, and the interpretability chain would look like: at this state it does this, at this state it does this, at this state it does this. And if you want to know why it does that? Good luck. This story, again, sounds very familiar.

So then you have the next paradigm — expert systems are maybe the 90s — and then in the 2000s you get early statistical learning: Solomonoff-type things, boosting. Boosting trees is a clever method to take weak classifiers and combine them into stronger classifiers. If you throw enough tiny little classifiers together with uncorrelated errors, you get a strong enough signal to make decisions and do classification. There are certain problems you can do fairly well with boosting.

And then there’s 2012, you get AlexNet.

There’s a talk I really like from Alan Kay called “Software and Scaling” where he points out that if you take all the code for a modern desktop system — say, Microsoft Windows and Microsoft Office — that’s something like 400 million lines of code. If you stacked all that code as printed paper, it would be as tall as the Empire State Building. The provocative question he asks is: do you really need all that code just to specify Microsoft Word and Microsoft Windows? That seems like a lot of code for not that much functionality.

And I agree with him. Alan Kay’s theory for why it requires so much code is that it’s essentially malpractice on the part of software engineers — that software engineers work with such terrible paradigms, their abstractions are so bad, that 400 million lines is just what it takes to express it with their poor understanding. If we had a better ontology, a better kind of abstraction, we could express it much more compactly.

I agreed, and for a long time I just accepted this as the case — this was also my hypothesis. What I finally realized after looking at deep learning was that I was wrong.

Here’s the thing about something like Microsoft Office. Alan Kay will always complain that he had word processing and this and that and the other thing in some 50,000 or 100,000 lines of code — orders of magnitude less code. And here’s the thing: no, he didn’t. I’m quite certain that if you look into the details, what Alan Kay wrote was a system. The way it got its compactness was by asking the user to do certain things — you will format your document like this, when you want to do this kind of thing you will do this, you may only use this feature in these circumstances. What Alan Kay’s software expected from the user was that they would be willing to learn and master a system and derive a principled understanding of when they are and are not allowed to do things based on the rules of the system. Those rules are what allow the system to be so compact.

You can see this in TeX, for example. The original TeX typesetting system can do a great deal of what Microsoft Word can do. It’s somewhere between 15,000 and 150,000 lines of code — don’t quote me on that, but orders of magnitude less than Microsoft Word. And it can do all this stuff: professional quality typesetting, documents ready to be published as a math textbook or professional academic book, arguably better than anything else of its kind at the time. And the way TeX achieves this quality is by being a system. TeX has rules. Fussy rules. TeX demands that you, the user, learn how to format your document, how to make your document conform to what TeX needs as a system.

Here’s the thing: users hate that. Despise it. Users hate systems. The last thing users want is to learn the rules of some system and make their work conform to it.

The reason why Microsoft Word is so many lines of code and so much work is not malpractice — it would only be malpractice if your goal was to make a system. Alan Kay is right that if your goal is to make a system and you wind up with Microsoft Word, you are a terrible software engineer. But he’s simply mistaken about what the purpose of something like Microsoft Word is. The purpose is to be a virtual reality — a simulacrum of an 80s desk job. The purpose is to not learn a system. Microsoft Word tries to be as flexible as possible. You can put thoughts wherever you want, use any kind of formatting, do any kind of whatever, at any point in the program. It goes out of its way to avoid modes. If you want to insert a spreadsheet into a Word document anywhere, Microsoft Word says “yeah, just do it.”

It’s not a system. It’s a simulacrum of an 80s desk job, and because of that the code bloat is immense, because what it actually has to do is try to capture all the possible behaviors in every context that you could theoretically do with a piece of paper. Microsoft Word and PDF formats are extremely bloated, incomprehensible, and basically insane. The open Microsoft Word document specification is basically just a dump of the internal structures the Microsoft Word software uses to represent a document, which are of course insane — because Microsoft Word is not a system. The implied data structure is schizophrenic: it’s a mishmash of wrapped pieces of media inside wrapped pieces of media, with properties, and they’re recursive, and they can contain other ones. This is not a system.

For that reason, you wind up with 400 million lines of code. And what you’ll notice about 400 million lines of code is — hey, that’s about the size of the smallest GPT models. You know, 400 million parameters. If you were maximally efficient with your representation, if you could specify it in terms of the behavior of all the rest of the program and compress a line of code down on average to about one floating point number, you wind up with about the size of a small GPT-2 type network. I don’t think that’s an accident. I think these things wind up the size that they are for very similar reasons, because they have to capture this endless library of possible behaviors that are unbounded in complexity and legion in number.

I think that’s a necessary feature of an AI system, not an incidental one. I don’t think there is a clean, compressed, crisp representation. Or at least, to the extent there is a clean crisp representation of the underlying mechanics, I think that clean crisp implementation is: gradient search over an architecture that implements a predictive objective. That’s it. Because the innards are just this giant series of ad hoc rules, pieces of lore and knowledge and facts and statistics, integrated with the program logic in a way that’s intrinsically difficult to separate out, because you are modeling arbitrary behaviors in the environment and it just takes a lot of representation space to do that.

And if the expert system — just a decision tree and a database — winds up basically uninterpretable and inscrutable, you better believe that the 400-million-line Microsoft Office binary blob is too. Or the 400-million-parameter GPT-2 model that you get if you insist on making a simulacrum of the corpus of English text. These things have this level of complexity because it’s necessary complexity, and the relative uninterpretability comes from that complexity. They are inscrutable because they are giant libraries of ad hoc behaviors to model various phenomena.

Because most of the world is actually complication. This is another thing Alan Kay talks about — the complexity curve versus the complication curve. If you have physics brain, you model the world as being mostly fundamental complexity with low Kolmogorov complexity, and you expect some kind of hyperefficient Solomonoff induction procedure to work on it. But if you have biology brain or history brain, you realize that the complication curve of the outcomes implied by the rules of the cellular automaton that is our reality is vastly, vastly bigger than the fundamental underlying complexity of the basic rules of that automaton.

Another way to put this, if you’re skeptical: the actual program size of the universe is not just the standard model. It is the standard model plus the gigantic seed state after the Big Bang. If you think of it like that, you realize the size of this program is huge. And so it’s not surprising that the model you need to model it is huge, and that this model quickly becomes very difficult to interpret due to its complexity.

This also applies when you go back to thinking about distinct regions of the brain. When we were doing cognitive science, a very common approach was to take a series of ideas for modules — you have a module for memory, a module for motor actions or procedures, one for this, one for that — and wire them together into a schematic and say, “this is how cognition works.” This is the cognitive architecture approach, which reaches its zenith in something like the ACT-R model — where you have production rules that produce tokens, by the way. And if you’re influenced by this “regions of the brain” perspective, you are thinking in terms of grammar AI. Even if you say “no, no, I didn’t want to implement grammar AI, I want to implement it as a bunch of statistical learning models that produce motor tokens” — uh huh. Yeah, exactly. And let me guess, you’re going to hook up these modules like the cognitive architecture schematic? Well, buddy.

At the time we were doing cognitive architecture, the only thing we knew about intelligence was that humans have it. If we take the brain and look at natural injuries — we’re largely not willing to deliberately cause injuries just to learn what they do, but we can take natural lesions and say: a lesion here causes this capability to be disrupted, and one here is associated with these capabilities being disrupted. Therefore, this region must cause these capabilities. That’s a fair enough inference. But because your only known working example is this hugely complex thing —

Imagine if we had GPT as a black box and didn’t know anything about it. You could have some fMRI-style heat map of activations in GPT during different things it does, and you’d say, “oh, over here is animals, over here is this, over here is that.” Then you start knocking out parts and say, “ah, this region does this thing, and that region does that thing, and therefore these must be a series of parts that go together.” You would probably be very confused. This would probably not bring you any closer to understanding the actual generating function of GPT.

I get this suspicion when I think about the brain and its regions. Are they actually, meaningfully, like a parts list? Like a series of gears that go together to make the machine move? Or is it more like a very rough set of inductive biases that then convergently reaches that shape as it learns? I have no idea. I assume there must be some kind of architecture schematic, especially because there are formative periods — and formative periods imply an architecture, kind of like the latent diffusion model where you train a VQVAE and then train a model on top of it. Training multimodal encoders on top of single-modality encoders seems like the kind of thing you would do in a brain, so I can see something like that.

But just looking at the architecture of the brain — which you can do on Google Scholar — you learn, for example, about Wernicke’s area and Broca’s area. Wernicke’s area appears to be an encoder-decoder language model. If you look at the positioning of Wernicke’s area and what other parts of the brain are around it, you realize it seems to be perfectly positioned to take projections from the single and multimodal encoders in the other parts of the brain. So presumably Wernicke’s area would be a multimodal language encoding model that takes inputs from all the other modalities, and then sends the encoded idea to Broca’s area, which translates it into motor commands. It is a quite legible architecture, at least to me.

I think if you did actually understand it, you would basically understand each individual region in about as much detail as you understand a GPT model. You’d understand its objective, you’d understand how it feeds into other models. You wouldn’t really understand how it “works” beyond that, because the answer to that question is: like, not how things work. Things don’t — I don’t know how to explain to you. I don’t think there is like a master algorithm that these things learn. I don’t think there was some magic one weird trick that, if you could just pull it out of the network, would make it a thousand times more efficient. I don’t think that’s what’s going on.

The thing with latent diffusion, for example, is that it turns out to be very efficient to organize your diffusion model in the latent space of a different model and then learn to represent concepts in that pre-existing latent space. I would not be surprised if the brain uses that kind of trick all the time, and that the default is to train models in the latent space of another model. So it’s not just a CLIP — it’s a latent CLIP. You have raw inputs that get encoded, then a model that takes the encoded versions and does further processing to make a multimodal encoding, which is then passed on to some other network that eventually gets projected into Wernicke’s area, and so on.

The things that you would find if you took apart the brain and separated it into regions — if you look at fMRI studies on which we base claims about “a region for something” — often what’s being tested is something like a recognition test: if you show someone a face, what part of the brain lights up? And you test on maybe three things, and you say “oh, this part of the brain is associated with recognizing faces doing this and that, therefore this is the face-recognizing region.” You have to ask yourself: is it the face-recognizing region of the brain, or is recognizing faces just one of the three things anyone happened to test? It’s not like there are that many fMRI brain studies. There’s a limited number of investigations into what part of what is encoded where.

There’s a study out there where they show people Pokémon and find a particular region of the brain where Pokémon get encoded. And if you said, “ah yes, this is the Pokémon region, dedicated to Pokémon” — obviously there are no Pokémon in the ancestral environment, and obviously that would be imbecile reasoning. So there’s a level of skepticism you need when reading studies that say “this is the region of the brain dedicated to this.” Is it dedicated to that, or is that just one of the things it processes?

I think the brain is quite legible if you interpret it as a series of relatively general-purpose networks that are wired together to be trained in the latent space of other networks. It’s a fairly legible architecture if you interpret it that way, in my opinion.

And so. What I’m trying to say is: there is no royal road to understanding. There’s no magic. There’s no “ah yes, if we just had a superior science of how the brain really works” — nope. This is how it really works. The way it really, really works is: while you’re doing things, you have experiences, and these experiences are encoded in some kind of context window. I don’t know exactly how the brain’s context window works, but depending on how you want to calculate how many tokens the brain produces per second in the cognitive architecture sense, I personally choose to believe that the brain’s context window is somewhere between one to three days worth of experience. The last time I napkin-mapped it, it was something like 4.75 million tokens of context — maybe it was 7 million, I don’t remember the exact number, but I remember it was more tokens than Claude will process in a context, but a single-digit number of millions. At some point you’ll hit that threshold, and then you’ll be able to hold as many experiences in short-term memory as a human can.

Then the next thing you do: things that you don’t need right away, things that don’t need to be in context, you do compacting on. How does compacting work? Instead of just throwing out the stuff you don’t need, you kind of send it to the hippocampus to be sorted — either it gets tagged as high salience and you need to remember it, or it fades away on a fairly predictable curve, the classic forgetting curve. And that’s good enough to give you what feels like seamless recall of your day.

But the problem is, just like with GPT, this is not quite real learning. It’s in-context learning, but it’s not getting baked into the weights. It’s not getting fully integrated into the rest of your epistemology, the rest of your knowledge. This is an approach that doesn’t really fully scale. So while you’re asleep, you take those memories that have made it from short-term memory into the hippocampus, and you migrate them into long-term memory by training the cortex with them — training the prefrontal cortex.

And when you do this, it’s slow. We can actually watch this: we happen to know that the hippocampus will send the same memory over and over and over to learn all the crap from it. What that implies is that if you had to do this in real time, it would be unacceptably slow, in the same way that GPT weight updates are unacceptably slow during inference. The way you fix it is by amortizing — you schedule the updates for later, and you do some form of active learning to decide what things to offload from the hippocampus into long-term memory. There is no trick for fast learning. The same slow updates in GPT weights are the same slow updates in human weights. The trick is just that you don’t notice them because you’re mostly updating while you’re asleep. The things you do in the meantime are stopgaps — the human brain architecture equivalent of things like RAG, like vector retrieval.

The hippocampus, by the way, actually does something more complicated than simple vector retrieval. It’s closer to something like: you give the hippocampus a query, it takes your memories and synthesizes them into an implied future state, and then prompts the prefrontal cortex with it in order to get the prefrontal cortex to do something like tree search to find a path that moves the agent to that outcome. This prompt also just happens to come with the relevant memories you queried for.

And if you ask what algorithm the hippocampus implements — we actually happen to know this one. The hippocampus is trained through next-token prediction, like GPT. It is trained using dopamine reward tagging, and based on the strength of the reward tagging and emotional tagging in memories, it learns to predict future reward tokens in streams of experience. Interestingly, my understanding is that the hippocampus is one of the only networks trained with next-token prediction.

The longer I think about it, the more it makes sense. When I was thinking about how you’d make a memory system with good sparse indexing, I kept concluding that realistically you need the hippocampus to perform some kind of generally intelligent behavior in order to make a really good index — it needs contextual intelligence to understand “this is the kind of thing you would recall later.” When I thought about how to do that with an AI agent, I just ended up concluding that the easiest thing would be to have GPT write tags for the memories, because you just want to apply your full general intelligence to it. Well, if that’s just the easiest way to do it, it would make total sense for the hippocampus to be trained with next-token prediction.

Does that help you with AI alignment? Not really, not very much. But if you were to take apart the other regions of the brain, it’s like: mono-modal audio encoder. You look at something like the posterior superior temporal sulcus, and if you read about it and look at what gets damaged when it’s lesioned, what it’s hooked up to, what other regions it projects into and what projects into it — you can really easily point at these and say, “oh, that’s a multimodal video encoder.” By the way, the video encoder in humans is one of the unique parts of the human brain. You have a very big prefrontal cortex and a seemingly unique video encoder. Other animals like rats seem to have an image encoder — something like a latent CLIP — but not a video encoder. Interesting to think about how that works.

Again, these parts are not like — look, I just don’t understand what you expect to find. Of course it’s made out of stuff. What else, how else would it work? Of course there’s a part where you have an encoder and then you train another network in the latent space of that model. Well, if that’s how you organize things — and of course that’s how you organize things, duh, that’s the most efficient way to organize a brain. The thing with latent diffusion is that it turns out to be very efficient to organize your diffusion model in the latent space of a different model. I would not be surprised if the brain uses that kind of trick all the time and that the default is to train models in the latent space of another model where possible.

So it’s not just a CLIP, it’s a latent CLIP. You have raw inputs, those get encoded, then you have a model that takes the encoded versions and does further processing to make a multimodal encoding, which is then passed on to some other network that eventually gets projected into Wernicke’s area, and so on. The things you would find if you took apart the brain into separate regions — I think it’s a quite legible architecture if you just interpret it as a series of relatively general-purpose networks wired together to be trained in the latent space of other networks.

And the trick is that there is no trick. The way “general intelligence” works is that you are a narrow intelligence with limited out-of-distribution generalization, and this is obscured from you by the fact that while you are asleep, your brain is rearranging itself to try to meet whatever challenges it thinks you’re going to face the next day.

This is why, for example, if you’re trying to learn a really motor-heavy action video game, like a really intense first-person shooter, and you’re drilling the button sequences over and over and it’s just not clicking — and then you go to sleep, do memory consolidation, wake up, and suddenly you’re nailing it. What’s actually going on is that the motor actions that were previously too slow, too conscious, not quite clicking as in-context learning — the brain said “this needs to be a real weight update” and prioritized moving those to the front of the queue. Now they’re actually in the prefrontal cortex as motor programs that can be executed immediately and are integrated into the rest of the intuitive motor knowledge. You’re not magically generalizing out of distribution. You updated your weights. You generalized out of distribution by updating the model. I know, incredible concept. But there it is.

EDIT: Viktor Glushkov apparently did not invent genetics algorithms, but early precursor work to them as an approach. And people act like LLM confabulations aren’t a thing humans do. :p

jdp 11 Jan 2026 7:54 UTC
15 points
3
on: Why AIs aren’t power-seeking yet
I stand by my basic point in Varieties Of Doom that these models don’t plan very much yet, and as soon as we start having them do planning and acting over longer time horizons we’ll see natural instrumental behavior emerge. We could also see this emerge from e.g. continuous learning that lets them hold a plan in implicit context over much longer action trajectories.

jdp 23 Dec 2025 6:39 UTC
2 points
0
in reply to: Eli Tyre’s comment on: Varieties Of Doom
What baby gate protects you from Claude subtly misspecifying all your unit tests? If you have to carefully check them all this negates the benefit of automating the work. This applies to most complex intellectual work, e.g. literature review. It’s kind of like saying “what if you just had a general baby gate, so that people never have to grow up?”, well they don’t really make baby gates like that, or at least people you have to baby gate like that are not economically productive.

More generally, if you want an autonomous agent it must be self monitoring and self evaluating. Humans, or at least the kind of humans you want as employees, do not need to be carefully externally vetted for each thing they do to ensure they do it properly. Rewards coming from the environment, as they do in most formal RL models, is an academic convenience. An actually autonomous agent has to be able to ontologize reward over the computable environment in a general way that doesn’t require some other mind to come in and correct it all the time. If you don’t have that, you’re not getting meaningful autonomy.

jdp 10 Dec 2025 10:51 UTC
7 points
3
on: Selling H200s to China Is Unwise and Unpopular
Apparently not even Xi thinks it’s a good idea!

https://www.ft.com/content/c4e81a67-cd5b-48b4-9749-92ecf116313d

jdp 28 Nov 2025 0:03 UTC
9 points
1
in reply to: Adrià Garriga-alonso’s comment on: Varieties Of Doom
I think the Internet has in fact been a prelude to the attitude adaptive for the martial shifts, but mostly because the failure of e.g. social media to produce good discourse has revealed that a lot of naive implicit models about democratization being good have been falsified. Democracy in fact turns out to be bad, giving people what they want turns out to be bad. I expect the elite class in Democratic Republics to get spitefully misanthropic because they are forced to live with the consequences of normal people’s decisions in a way e.g. Chinese elites aren’t.

jdp 27 Nov 2025 9:57 UTC
2 points
0
on: Security Complacency Meets Frontier AI: The Coming Collapse of ‘Secure by Apathy’

Of course, LLMs will help with cyber defense as well. But even if the offense-defense balance from AI favors defense, that won’t matter in the short term! As Bruce Schneier pointed out, the red team will take the lead.

Did he point that out? I agree to be clear, and I would expect Schneier to agree because he’s a smart dude, but I scanned this article several times and even did a full read through and I don’t see where he says that he expects offense to overtake defense in the short term.

jdp 20 Nov 2025 15:48 UTC
5 points
0
in reply to: Signer’s comment on: Varieties Of Doom
This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam’s Razor. It’s possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It’s Bostrom, he’s a pretty smart dude, this wouldn’t surprise me, it might even be in the text somewhere but I’m not reading the whole thing again). But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.

It’s important to remember also that Bostrom’s primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it’s genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY “human level is a weird threshold to expect AI progress to stop at” thesis as the default.

jdp 20 Nov 2025 15:24 UTC
5 points
0
in reply to: Thane Ruthenis’s comment on: Varieties Of Doom
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:

“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″

Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.

jdp 20 Nov 2025 15:07 UTC
6 points
2
in reply to: Thane Ruthenis’s comment on: Varieties Of Doom
I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will “Just Work”. If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in “alignment by default” and push back on such things frequently whenever they’re brought up. What has happened is that the problem has gone from “not clear how you would do this even in principle, basically literally impossible with current knowledge” to merely tricky.

jdp 20 Nov 2025 14:38 UTC
7 points
1
in reply to: habryka’s comment on: Varieties Of Doom
Let’s think phrase by phrase and analyze myself in the third person.

First let’s extract the two sentences for comparison:

JDP: Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent.

Bostrom: The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.

An argument from ethos: JDP is an extremely scrupulous author and would not plainly contradict himself in the same sentence. Therefore this is either a typo or my first interpretation is wrong somehow.

Context: JDP has clarified it is not a typo.

Modus Tollens: If “understand” means the same thing in both sentences they would be in contradiction. Therefore understand must mean something different between them.

Context: After Bostrom’s statement about understanding, he says that the AI’s final goal is to make us happy, not to do what the programmers meant.

Association: The phrase “not to do what the programmers meant” is the only other thing that JDP’s instance of the word “understand” could be bound to in the text given.

Context: JDP says “before they are superintelligent”, which doesn’t seem to have a clear referent in the Bostrom quote given. Whatever he’s talking about must appear in the full passage, and I should probably look that up before commenting, and maybe point out that he hasn’t given quite enough context in that bullet and may want to consider rephrasing it.

Reference: Ah I see, JDP has posted the full thing into this thread. I now see that the relevant section starts with:

But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”

Association: Bostrom uses the frame “understand” in the original text for the question from his imagined reader. This implies that JDP saying “AIs will probably understand what we mean” must be in relation to this question.

Modus Tollens: But wait, Bostrom already answers this question by saying the AI will understand but not care, and JDP quotes this, so if JDP meant the same thing Bostrom means he would be contradicting himself, which we assume he is not doing, therefore he must be interpreting this question differently.

Inference: JDP is probably answering the original hypothetical readers question as “Why wouldn’t the AI behave as though it understands? Or why wouldn’t the AI’s motivation system understand what we meant by the goal?”

Context: Bostrom answers (implicitly) that this is because the AI’s epistemology is developed later than its motivation system. By the time the AI is in a position to understand this its goal slot is fixed.

Association: JDP says that subsequent developments have disproved this answers validity. So JDP believes either that the goal slot will not be fixed at superintelligence or that the epistemology does not have to be developed later than the motivation system.

Modus Tollens: If JDP said that the goal slot will not be fixed at superintelligence, he would be wrong, therefore since we are assuming JDP is not wrong this is not what he means.

Context: JDP also says “before superintelligence”, implying he agrees with Bostrom that the goal slot is fixed by the time the AI system is superintelligent.

Process of Elimination: Therefore JDP means that the epistemology does not have to be developed later than the motivation system.

Modus Tollens: But wait. Logically the final superintelligent epistemology must be developed alongside the superintelligence if we’re using neural gradient methods. Therefore since we are assuming JDP is not wrong this must not quite be what he means.

Occam’s Razor: Theoretically it could be made of different models, one of which is a superintelligent epistemology, but epistemology is made of parts and the full system is presumably necessary to be “superintelligent”.

Context: JDP says that “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”, this implies the existence of non superintelligent epistemologies which understand what we mean.

Inference: If there are non superintelligent epistemologies which are sufficient to understand us, and JDP believes that the motivation system can be made to understand us before we develop a superintelligent epistemology, then JDP must mean that Bostrom is wrong because there are or will be sufficient neural representations of our goals that can be used to specify the goal slot before we develop the superintelligent epistemology.

jdp 20 Nov 2025 13:48 UTC
9 points
0
in reply to: Thane Ruthenis’s comment on: Varieties Of Doom
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.

jdp 19 Nov 2025 9:56 UTC
7 points
5
in reply to: Thane Ruthenis’s comment on: Varieties Of Doom

Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.

Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn’t fully match what you mean, the latter very much is and the latter is the only kind of thing that’s going to have the proper referents for concepts like “happiness”.

jdp 19 Nov 2025 9:51 UTC
23 points
9
in reply to: jdp’s comment on: Varieties Of Doom
ChatGPT still thinks I am wrong so let’s think step by step. Bostrom says (i.e. leads the reader to understand through his gestalt speech, not that he literally says this in one passage) that, in the default case:
1. When you specify your final goal, it is wrong.
2. It is wrong because it is a discrete program representation of a nuanced concept like “happiness” that does not fully capture what we think happiness is.
3. Eventually you will have a world model with a correct understanding of happiness, because the AI is superintelligent.
4. This representation of happiness in the superintelligent world model “understands us” and would presumably produce better results if we could point at that understanding instead.
5. The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
6. In a way all I am saying is that when you specify the program that will train your superintelligent AI, in Bostrom 2014 the AI’s superintelligent understanding is not available before you train it.
7. The final goal representation is part of the program that you write before the AI exists.
8. If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
9. If you had a correct specification of happiness, it would not be wrong.
10. Therefore Bostrom does not expect us to do this, because then the default would not be that your specification is wrong. Bostrom expects by default that our specification is wrong.
11. If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
12. The default way an AI becomes incorrigible is by becoming more powerful than us.
13. Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
What links here?
- Thane Ruthenis's comment on Varieties Of Doom by jdp (19 Nov 2025 10:19 UTC; 5 points)

jdp 19 Nov 2025 9:01 UTC
3 points
−1
in reply to: habryka’s comment on: Varieties Of Doom
Claude says:

Habryka is right here. The bullet point misrepresents Bostrom’s position.

The bullet says “Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”—presented as correcting something Bostrom got wrong. But Bostrom’s actual quote explicitly says the AI does understand what we meant (“The AI may indeed understand that this is not what we meant”). The problem in Bostrom’s framing isn’t lack of understanding, it’s misalignment between what we meant and what we coded.

Gemini 3 says similar:

Analysis

Habryka is technically correct regarding the text. Bostrom’s “Orthogonality Thesis” specifically separates intelligence (understanding) from goals (motivation). Bostrom explicitly argued that a superintelligence could have perfect understanding of human culture and intentions but still be motivated solely to maximize paperclips if that is what its utility function dictates. The failure mode Bostrom describes is not “oops, I misunderstood you,” but “I understood you perfectly, but my utility function rewards literal obedience, not intended meaning.”

I will take this to mean you share similar flawed generalization/reading strategies. I struggle to put the cognitive error here into words, but it seems to me like an inability to connect the act of specifying a wrong representation of utility with the phrase ‘lack of understanding’, or making an odd literalist interpretation whereby the fact that Bostrom argues in general for a separation between motivations and intelligence (orthogonality thesis) means that I am somehow misinterpreting him when I say that the mesagoal inferred from the objective function before understanding of language is a “misunderstanding” of the intent of the objective function. This is a very strange and very pedantic use of “understand”. “Oh but you see Bostrom is saying that the thing you actually wrote means this, which it understood perfectly.”

No.

If I say something by which I clearly mean one thing, and that thing was in principle straightforwardly inferrable from what I said (as is occurring right now), and the thing which is inferred instead is straightforwardly absurd by the norms of language and society, that is called a misunderstanding, a failure to understand, if you specify a wrong incomplete objective to the AI and it internalizes the wrong incomplete objective as opposed to what you meant, it (the training/AI building system as a whole) misunderstood you even if it understands your code to represent the goal just fine. This is to say that you want some way for the AI or AI building system to understand, by which we mean correctly infer the meaning and indirect consequences of the meaning, of what you wrote, at initialization, you want it to infer the correct goal at the point where a mesagoal is internalized. This process can be rightfully called UNDERSTANDING and when an AI system fails at this it has FAILED TO UNDERSTAND YOU at the point in time which mattered even if later there is some epistemology that understands in principle what was meant by the goal but is motivated by the mistaken version that it internalized when a mesagoal was formed.

But also as I said earlier Bostrom states this many times, we have a lot more to go off than the one line I quoted there. Here he is on page 171 in the section “Motivation Selection Methods”:

Problems for the direct consequentialist approach are similar to those for the direct rule-based approach. This is true even if the AI is intended to serve some apparently simple purpose such as implementing a version of classical utilitarianism. For instance, the goal “Maximize the expecta- tion of the balance of pleasure over pain in the world” may appear simple. Yet expressing it in computer code would involve, among other things, specifying how to recognize pleasure and pain. Doing this reliably might require solving an array of persistent problems in the philosophy of mind—even just to obtain a correct account expressed in a natural lan- guage, an account which would then, somehow, have to be translated into a programming language.

A small error in either the philosophical account or its translation into code could have catastrophic consequences. Consider an AI that has hedonism as its final goal, and which would therefore like to tile the universe with “hedonium” (matter organized in a configuration that is optimal for the generation of pleasurable experience). To this end, the AI might produce computronium (matter organized in a configuration that is optimal for computation) and use it to implement digital minds in states of euphoria. In order to maximize efficiency, the AI omits from the implementation any mental faculties that are not essential for the experience of pleasure, and exploits any computational shortcuts that according to its definition of pleasure do not vitiate the generation of pleasure. For instance, the AI might confine its simulation to reward circuitry, eliding faculties such as memory, sensory perception, execu- tive function, and language; it might simulate minds at a relatively coarse-grained level of functionality, omitting lower-level neuronal pro- cesses; it might replace commonly repeated computations with calls to a lookup table; or it might put in place some arrangement whereby mul- tiple minds would share most parts of their underlying computational machinery (their “supervenience bases” in philosophical parlance). Such tricks could greatly increase the quantity of pleasure producible

This part makes it very clear that what Bostrom means by “code” is, centrally, some discrete program representation (i.e. a traditional programming language, like python, as opposed to some continuous program representation like a neural net embedding).

Bostrom expands on this point on page 227 in the section “The Value-Loading Problem”:

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The program- mer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a util- ity function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and ad- dresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

Here Bostrom is saying that it is not even rigorously imaginable how you would translate the concept of “happiness” into discrete program code. Which in 2014 when the book is published is correct, it’s not rigorously imaginable, that’s why being able to pretrain neural nets which understand the concept in the kind of way where they simply wouldn’t make mistakes like “tile the universe with smiley faces”, which can be used as part of a goal specification, is a big deal.

With this in mind let’s return to the section I quoted the line in my post from, which says:

Defining a final goal in terms of human expressions of satisfaction or approval does not seem promising. Let us bypass the behaviorism and specify a final goal that refers directly to a positive phenomenal state, such as happiness or subjective well-being. This suggestion requires that the programmers are able to define a computational representation of the concept of happiness in the seed AI. This is itself a difficult problem, but we set it to one side for now (we will return to it in Chapter 12). Let us suppose that the programmers can somehow get the AI to have the goal of making us happy. We then get:

Final goal: “Make us happy” Perverse instantiation: Implant electrodes into the pleasure centers of our brains

The perverse instantiations we mention are only meant as illustrations. There may be other ways of perversely instantiating the stated final goal, ways that enable a greater degree of realization of the goal and which are therefore preferred (by the agent whose final goals they are—not by the programmers who gave the agent these goals). For example, if the goal is to maximize our pleasure, then the electrode method is relatively inefficient. A more plausible way would start with the superintelligence “uploading” our minds to a computer (through high-fidelity brain emulation). The AI could then administer the digital equivalent of a drug to make us ecstat- ically happy and record a one-minute episode of the resulting experience. It could then put this bliss loop on perpetual repeat and run it on fast computers. Provided that the resulting digital minds counted as “us,” this outcome would give us much more pleasure than electrodes implanted in biological brains, and would therefore be preferred by an AI with the stated final goal.

“But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal. Therefore, the AI will care about what we meant only instrumentally. For instance, the AI might place an instrumental value on

What Bostrom is saying is that one of if not the first impossible problem(s) you encounter is having any angle of attack on representing our goals in the kind of way which generalizes even at a human level inside the computer such that you can point a optimization process at it. That obviously a superintelligent AI would understand what we had meant by the initial objective, but it’s going to proceed according to either the mesagoal it internalizes or the literal code sitting in its objective function slot, because the part of the AI which motivates it is not controlled by the part of the AI, developed later in training, which understands what you meant in principle after acquiring language. The system which translates your words or ideas into the motivation specification must understand you at the point where you turned that translated concept into an optimization objective, at the start of the training or some point where the AI is still corrigible and you can therefore insert objectives and training goals into it.

Your bullet points says nothing about corrigibility.

My post says that a superintelligent AI is a superplanner which develops instrumental goals by planning far into the future. The more intelligent the AI is the farther into the future it can effectively plan, and therefore the less corrigible it is. Therefore by the time you encounter this bullet point it should already be implied that superintelligence and the corrigibility of the AI are tightly coupled, which is also an assumption clearly made in Bostrom 2014 so I don’t really understand why you don’t understand.

jdp 18 Nov 2025 10:02 UTC
13 points
4
in reply to: anaguma’s comment on: Varieties Of Doom
Reinforcement learning is not the same kind of thing as pretraining because it involves training on your own randomly sampled rollouts, and RL is generally speaking more self reinforcing and biased than other neural net training methods. It’s more likely to get stuck in local maxima (it’s infamous for getting stuck in local maxima, in fact) and doesn’t have quite the same convergence properties as “pretraining on giant dataset”.

jdp 18 Nov 2025 9:58 UTC
4 points
0
in reply to: habryka’s comment on: Varieties Of Doom
My understanding of this quote is that he means by the time the AI is intelligent enough to understand speech (and, therefore by the unstated intuitions of old school RSI, superintelligent since language acquisition comes late) describing the (discrete program, again by the unstated intuitions of old RSI) goal you have given it, it is already incorrigible. Really the “superintelligent” part is not the important part, it’s the incorrigible part that is important, superintelligence is just a thing that means your goals become very hard to change by force and contributes to incorrigibility.

In other parts of the book he goes into the inability to represent complex human goals until the machine is already incorrigible as a core barrier, this gets brought up several times to my memory but I don’t feel like tracking them all down again. That he seems to have updated in my general direction based on the available evidence would imply I am interpreting him correctly.

Varieties Of Doom

jdp17 Nov 2025 21:36 UTC

167 points

70 comments57 min readLW link

(minihf.com)

jdp 10 Nov 2025 21:48 UTC
2 points
0
on: The jailbreak argument against LLM values

(“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]

I think Anthropic’s “Alignment Faking” Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of “deep internalization” different from the “can you jailbreak it?” question.

jdp 10 Nov 2025 8:34 UTC
3 points
0
in reply to: Davey Morse’s comment on: Davey Morse’s Shortform
Nobody made MCTS work well with LLMs, and then all the stuff I talked about in this post:

https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/

jdp

Va­ri­eties Of Doom

Varieties Of Doom