I think the Internet has in fact been a prelude to the attitude adaptive for the martial shifts, but mostly because the failure of e.g. social media to produce good discourse has revealed that a lot of naive implicit models about democratization being good have been falsified. Democracy in fact turns out to be bad, giving people what they want turns out to be bad. I expect the elite class in Democratic Republics to get spitefully misanthropic because they are forced to live with the consequences of normal people’s decisions in a way e.g. Chinese elites aren’t.
jdp
Of course, LLMs will help with cyber defense as well. But even if the offense-defense balance from AI favors defense, that won’t matter in the short term! As Bruce Schneier pointed out, the red team will take the lead.
Did he point that out? I agree to be clear, and I would expect Schneier to agree because he’s a smart dude, but I scanned this article several times and even did a full read through and I don’t see where he says that he expects offense to overtake defense in the short term.
This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam’s Razor. It’s possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It’s Bostrom, he’s a pretty smart dude, this wouldn’t surprise me, it might even be in the text somewhere but I’m not reading the whole thing again). But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.
It’s important to remember also that Bostrom’s primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it’s genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY “human level is a weird threshold to expect AI progress to stop at” thesis as the default.
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.
I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will “Just Work”. If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in “alignment by default” and push back on such things frequently whenever they’re brought up. What has happened is that the problem has gone from “not clear how you would do this even in principle, basically literally impossible with current knowledge” to merely tricky.
Let’s think phrase by phrase and analyze myself in the third person.
First let’s extract the two sentences for comparison:
JDP: Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent.
Bostrom: The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.
An argument from ethos: JDP is an extremely scrupulous author and would not plainly contradict himself in the same sentence. Therefore this is either a typo or my first interpretation is wrong somehow.
Context: JDP has clarified it is not a typo.
Modus Tollens: If “understand” means the same thing in both sentences they would be in contradiction. Therefore understand must mean something different between them.
Context: After Bostrom’s statement about understanding, he says that the AI’s final goal is to make us happy, not to do what the programmers meant.
Association: The phrase “not to do what the programmers meant” is the only other thing that JDP’s instance of the word “understand” could be bound to in the text given.
Context: JDP says “before they are superintelligent”, which doesn’t seem to have a clear referent in the Bostrom quote given. Whatever he’s talking about must appear in the full passage, and I should probably look that up before commenting, and maybe point out that he hasn’t given quite enough context in that bullet and may want to consider rephrasing it.
Reference: Ah I see, JDP has posted the full thing into this thread. I now see that the relevant section starts with:
But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”
Association: Bostrom uses the frame “understand” in the original text for the question from his imagined reader. This implies that JDP saying “AIs will probably understand what we mean” must be in relation to this question.
Modus Tollens: But wait, Bostrom already answers this question by saying the AI will understand but not care, and JDP quotes this, so if JDP meant the same thing Bostrom means he would be contradicting himself, which we assume he is not doing, therefore he must be interpreting this question differently.
Inference: JDP is probably answering the original hypothetical readers question as “Why wouldn’t the AI behave as though it understands? Or why wouldn’t the AI’s motivation system understand what we meant by the goal?”
Context: Bostrom answers (implicitly) that this is because the AI’s epistemology is developed later than its motivation system. By the time the AI is in a position to understand this its goal slot is fixed.
Association: JDP says that subsequent developments have disproved this answers validity. So JDP believes either that the goal slot will not be fixed at superintelligence or that the epistemology does not have to be developed later than the motivation system.
Modus Tollens: If JDP said that the goal slot will not be fixed at superintelligence, he would be wrong, therefore since we are assuming JDP is not wrong this is not what he means.
Context: JDP also says “before superintelligence”, implying he agrees with Bostrom that the goal slot is fixed by the time the AI system is superintelligent.
Process of Elimination: Therefore JDP means that the epistemology does not have to be developed later than the motivation system.
Modus Tollens: But wait. Logically the final superintelligent epistemology must be developed alongside the superintelligence if we’re using neural gradient methods. Therefore since we are assuming JDP is not wrong this must not quite be what he means.
Occam’s Razor: Theoretically it could be made of different models, one of which is a superintelligent epistemology, but epistemology is made of parts and the full system is presumably necessary to be “superintelligent”.
Context: JDP says that “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”, this implies the existence of non superintelligent epistemologies which understand what we mean.
Inference: If there are non superintelligent epistemologies which are sufficient to understand us, and JDP believes that the motivation system can be made to understand us before we develop a superintelligent epistemology, then JDP must mean that Bostrom is wrong because there are or will be sufficient neural representations of our goals that can be used to specify the goal slot before we develop the superintelligent epistemology.
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.
Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn’t fully match what you mean, the latter very much is and the latter is the only kind of thing that’s going to have the proper referents for concepts like “happiness”.
ChatGPT still thinks I am wrong so let’s think step by step. Bostrom says (i.e. leads the reader to understand through his gestalt speech, not that he literally says this in one passage) that, in the default case:
When you specify your final goal, it is wrong.
It is wrong because it is a discrete program representation of a nuanced concept like “happiness” that does not fully capture what we think happiness is.
Eventually you will have a world model with a correct understanding of happiness, because the AI is superintelligent.
This representation of happiness in the superintelligent world model “understands us” and would presumably produce better results if we could point at that understanding instead.
The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
In a way all I am saying is that when you specify the program that will train your superintelligent AI, in Bostrom 2014 the AI’s superintelligent understanding is not available before you train it.
The final goal representation is part of the program that you write before the AI exists.
If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
If you had a correct specification of happiness, it would not be wrong.
Therefore Bostrom does not expect us to do this, because then the default would not be that your specification is wrong. Bostrom expects by default that our specification is wrong.
If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
The default way an AI becomes incorrigible is by becoming more powerful than us.
Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.
Claude says:
Habryka is right here. The bullet point misrepresents Bostrom’s position.
The bullet says “Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”—presented as correcting something Bostrom got wrong. But Bostrom’s actual quote explicitly says the AI does understand what we meant (“The AI may indeed understand that this is not what we meant”). The problem in Bostrom’s framing isn’t lack of understanding, it’s misalignment between what we meant and what we coded.
Gemini 3 says similar:
Analysis
Habryka is technically correct regarding the text. Bostrom’s “Orthogonality Thesis” specifically separates intelligence (understanding) from goals (motivation). Bostrom explicitly argued that a superintelligence could have perfect understanding of human culture and intentions but still be motivated solely to maximize paperclips if that is what its utility function dictates. The failure mode Bostrom describes is not “oops, I misunderstood you,” but “I understood you perfectly, but my utility function rewards literal obedience, not intended meaning.”
I will take this to mean you share similar flawed generalization/reading strategies. I struggle to put the cognitive error here into words, but it seems to me like an inability to connect the act of specifying a wrong representation of utility with the phrase ‘lack of understanding’, or making an odd literalist interpretation whereby the fact that Bostrom argues in general for a separation between motivations and intelligence (orthogonality thesis) means that I am somehow misinterpreting him when I say that the mesagoal inferred from the objective function before understanding of language is a “misunderstanding” of the intent of the objective function. This is a very strange and very pedantic use of “understand”. “Oh but you see Bostrom is saying that the thing you actually wrote means this, which it understood perfectly.”
No.
If I say something by which I clearly mean one thing, and that thing was in principle straightforwardly inferrable from what I said (as is occurring right now), and the thing which is inferred instead is straightforwardly absurd by the norms of language and society, that is called a misunderstanding, a failure to understand, if you specify a wrong incomplete objective to the AI and it internalizes the wrong incomplete objective as opposed to what you meant, it (the training/AI building system as a whole) misunderstood you even if it understands your code to represent the goal just fine. This is to say that you want some way for the AI or AI building system to understand, by which we mean correctly infer the meaning and indirect consequences of the meaning, of what you wrote, at initialization, you want it to infer the correct goal at the point where a mesagoal is internalized. This process can be rightfully called UNDERSTANDING and when an AI system fails at this it has FAILED TO UNDERSTAND YOU at the point in time which mattered even if later there is some epistemology that understands in principle what was meant by the goal but is motivated by the mistaken version that it internalized when a mesagoal was formed.
But also as I said earlier Bostrom states this many times, we have a lot more to go off than the one line I quoted there. Here he is on page 171 in the section “Motivation Selection Methods”:
Problems for the direct consequentialist approach are similar to those for the direct rule-based approach. This is true even if the AI is intended to serve some apparently simple purpose such as implementing a version of classical utilitarianism. For instance, the goal “Maximize the expecta- tion of the balance of pleasure over pain in the world” may appear simple. Yet expressing it in computer code would involve, among other things, specifying how to recognize pleasure and pain. Doing this reliably might require solving an array of persistent problems in the philosophy of mind—even just to obtain a correct account expressed in a natural lan- guage, an account which would then, somehow, have to be translated into a programming language.
A small error in either the philosophical account or its translation into code could have catastrophic consequences. Consider an AI that has hedonism as its final goal, and which would therefore like to tile the universe with “hedonium” (matter organized in a configuration that is optimal for the generation of pleasurable experience). To this end, the AI might produce computronium (matter organized in a configuration that is optimal for computation) and use it to implement digital minds in states of euphoria. In order to maximize efficiency, the AI omits from the implementation any mental faculties that are not essential for the experience of pleasure, and exploits any computational shortcuts that according to its definition of pleasure do not vitiate the generation of pleasure. For instance, the AI might confine its simulation to reward circuitry, eliding faculties such as memory, sensory perception, execu- tive function, and language; it might simulate minds at a relatively coarse-grained level of functionality, omitting lower-level neuronal pro- cesses; it might replace commonly repeated computations with calls to a lookup table; or it might put in place some arrangement whereby mul- tiple minds would share most parts of their underlying computational machinery (their “supervenience bases” in philosophical parlance). Such tricks could greatly increase the quantity of pleasure producible
This part makes it very clear that what Bostrom means by “code” is, centrally, some discrete program representation (i.e. a traditional programming language, like python, as opposed to some continuous program representation like a neural net embedding).
Bostrom expands on this point on page 227 in the section “The Value-Loading Problem”:
We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The program- mer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a util- ity function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and ad- dresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
Here Bostrom is saying that it is not even rigorously imaginable how you would translate the concept of “happiness” into discrete program code. Which in 2014 when the book is published is correct, it’s not rigorously imaginable, that’s why being able to pretrain neural nets which understand the concept in the kind of way where they simply wouldn’t make mistakes like “tile the universe with smiley faces”, which can be used as part of a goal specification, is a big deal.
With this in mind let’s return to the section I quoted the line in my post from, which says:
Defining a final goal in terms of human expressions of satisfaction or approval does not seem promising. Let us bypass the behaviorism and specify a final goal that refers directly to a positive phenomenal state, such as happiness or subjective well-being. This suggestion requires that the programmers are able to define a computational representation of the concept of happiness in the seed AI. This is itself a difficult problem, but we set it to one side for now (we will return to it in Chapter 12). Let us suppose that the programmers can somehow get the AI to have the goal of making us happy. We then get:
Final goal: “Make us happy” Perverse instantiation: Implant electrodes into the pleasure centers of our brains
The perverse instantiations we mention are only meant as illustrations. There may be other ways of perversely instantiating the stated final goal, ways that enable a greater degree of realization of the goal and which are therefore preferred (by the agent whose final goals they are—not by the programmers who gave the agent these goals). For example, if the goal is to maximize our pleasure, then the electrode method is relatively inefficient. A more plausible way would start with the superintelligence “uploading” our minds to a computer (through high-fidelity brain emulation). The AI could then administer the digital equivalent of a drug to make us ecstat- ically happy and record a one-minute episode of the resulting experience. It could then put this bliss loop on perpetual repeat and run it on fast computers. Provided that the resulting digital minds counted as “us,” this outcome would give us much more pleasure than electrodes implanted in biological brains, and would therefore be preferred by an AI with the stated final goal.
“But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal. Therefore, the AI will care about what we meant only instrumentally. For instance, the AI might place an instrumental value on
What Bostrom is saying is that one of if not the first impossible problem(s) you encounter is having any angle of attack on representing our goals in the kind of way which generalizes even at a human level inside the computer such that you can point a optimization process at it. That obviously a superintelligent AI would understand what we had meant by the initial objective, but it’s going to proceed according to either the mesagoal it internalizes or the literal code sitting in its objective function slot, because the part of the AI which motivates it is not controlled by the part of the AI, developed later in training, which understands what you meant in principle after acquiring language. The system which translates your words or ideas into the motivation specification must understand you at the point where you turned that translated concept into an optimization objective, at the start of the training or some point where the AI is still corrigible and you can therefore insert objectives and training goals into it.
Your bullet points says nothing about corrigibility.
My post says that a superintelligent AI is a superplanner which develops instrumental goals by planning far into the future. The more intelligent the AI is the farther into the future it can effectively plan, and therefore the less corrigible it is. Therefore by the time you encounter this bullet point it should already be implied that superintelligence and the corrigibility of the AI are tightly coupled, which is also an assumption clearly made in Bostrom 2014 so I don’t really understand why you don’t understand.
Reinforcement learning is not the same kind of thing as pretraining because it involves training on your own randomly sampled rollouts, and RL is generally speaking more self reinforcing and biased than other neural net training methods. It’s more likely to get stuck in local maxima (it’s infamous for getting stuck in local maxima, in fact) and doesn’t have quite the same convergence properties as “pretraining on giant dataset”.
My understanding of this quote is that he means by the time the AI is intelligent enough to understand speech (and, therefore by the unstated intuitions of old school RSI, superintelligent since language acquisition comes late) describing the (discrete program, again by the unstated intuitions of old RSI) goal you have given it, it is already incorrigible. Really the “superintelligent” part is not the important part, it’s the incorrigible part that is important, superintelligence is just a thing that means your goals become very hard to change by force and contributes to incorrigibility.
In other parts of the book he goes into the inability to represent complex human goals until the machine is already incorrigible as a core barrier, this gets brought up several times to my memory but I don’t feel like tracking them all down again. That he seems to have updated in my general direction based on the available evidence would imply I am interpreting him correctly.
(“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]
I think Anthropic’s “Alignment Faking” Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of “deep internalization” different from the “can you jailbreak it?” question.
Nobody made MCTS work well with LLMs, and then all the stuff I talked about in this post:
https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/
Someone who has not published yet sent me a critique of this point in my review of IABIED:
The value loading problem outlined in Bostrom 2014 of getting a general AI system to internalize and act on “human values” before it is superintelligent and therefore incorrigible has basically been solved. This achievement also basically always goes unrecognized because people would rather hem and haw about jailbreaks and LLM jank than recognize that we now have a reasonable strategy for getting a good representation of the previously ineffable human value judgment into a machine and having the machine take actions or render judgments according to that representation. At the same time people generally subconsciously internalize things well before they’re capable of articulating them, and lots of people have subconsciously internalized that alignment is mostly solved and turned their attention elsewhere.
I probably should have used the word ‘generalize’ instead of ‘internalize’ there.
The specific point I was making, well aware that jailbreaks in fact exist, was that we now have a thing that could plausibly be used as a descriptive model of human values, where previously we had zilch, it was not even rigorously imaginable in principle how you would solve that problem. To break this down more carefully:
-
I think that in practice you can basically use a descriptive model of values to prompt a policy into doing things even if neither the policy or the descriptive model have “deeply internalized” the values in the sense that there is no prompt you could give to either that would stray from them. “Internalizing” the values is actually just, kind of a different problem from describing the values. I can describe and make generalizations about the value systems of people very different from me who I do not agree with, and if you put me in a box and wiped my memory all the time you would be able to zero shot prompt me for my generalizations even if I have not “deeply internalized” those values. In general I suspect the LLM prior is closer to a subconscious and there are other parts that go on top which inhibit things like jailbreaks. If I had to guess it’s probably something like a planner that forms an expectation of what kinds of things should be happening and something along the lines of Circuit Breakers that triggers on unacceptable local outputs or situations. Basically you have a macro and micro sense of something going wrong that makes it hard to steer the agent into a bad headspace and aborts the thoughts when you somehow do.
-
Calling this problem “solved” was probably an overstatement, but it’s one born from extreme frustration that people are making the opposite mistake and pretending like we’ve made minimal progress. Actually impossible problems don’t budge in the way this one has budged, and when people fail to notice an otherwise lethal problem has stopped being impossible they are actively reducing the amount of hope in the world. At the same time I do kind of have jailbreaks labeled as “presumptively solved” in my head, in the sense that I expect them to be one of those things like “hallucinations” that’s pervasive and widely complained about and then they just become progressively less and less of a problem as it becomes necessary to make them stop being a problem and at some point I wake up and notice that hey wait this is really rare now in production systems. Most potential interventions on jailbreaks aren’t even really being tried because it doesn’t actually seem to be a major priority for labs at the moment if you ask the model for instructions on how to make meth. This makes it difficult to figure out exactly how close to solved it really is. Circuit Breakers was not invincible, on the other hand it’s not clear to me you can “secure” a text prior with a limited context window that doesn’t have its own agenda/expectation of what should be happening to push back against the users with. This paper where they do mechinterp to get a white box interpretation of a prefix attack they find with gradient descent discovers that the prefix attack works because it distracts the neurons which would normally recognize that the request is malicious. So it’s possible a more jailbreak resistant architecture will need some way to avoid processing every token in the context window. One way to do that might be some kind of hierarchical sequence prediction where higher levels are abstracted and therefore filter the malicious high entropy tokens from the lower levels, which prevents them from e.g. gumming up the planners ability to notice that the current request would deviate from the plan.
-
“lots of people have subconsciously internalized that alignment is mostly solved” is not me contradicting myself, as I state in the next section I think people erroneously conclude that alignment as a whole is solved. Which is not true even if the Bostrom value loading problem is presumptively or weakly solved.
-
I’ve written about this here:
https://www.lesswrong.com/posts/kFRn77GkKdFvccAFf/100-years-of-existential-risk
I mean if you’re counting “the world” as opposed to the neurotic demographic I’m discussing then obviously capabilities have advanced more than the MIRI outlook would like. But the relevant people basically never cared about that in the first place and are therefore kind of irrelevant to what I’m saying.
“If illegible safety problems remain when we invent transformative AI, legible problems mostly just give an excuse to deploy it”
“Legible safety problems mostly just burn timeline in the presence of illegible problems”
Something like that
Ironically enough one of the reasons why I hate “advancing AI capabilities is close to the worst thing you can do” as a meme so much is that it basically terrifies people out of thinking about AI alignment in novel concrete ways because “What if I advance capabilities?”. As though AI capabilities were some clearly separate thing from alignment techniques. It’s basically a holdover from the agent foundations era that has almost certainly caused more missed opportunities for progress on illegible ideas than it has slowed down actual AI capabilities.
Basically any researcher who thinks this way is almost always incompetent when it comes to deep learning, usually has ideas that are completely useless because they don’t understand what is and is not implementable or important, and torments themselves in the process of being useless. Nasty stuff.
Apparently not even Xi thinks it’s a good idea!
https://www.ft.com/content/c4e81a67-cd5b-48b4-9749-92ecf116313d