My understanding: JDP holds that when the training process chisels a wrong goal into an AI because we gave it a wrong training objective (e. g., “maximize smiles” while we want “maximize eudaimonia”), this event could be validly described as the AI “misunderstanding” us.
So when JDP says that “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”, and claims that this counters this Bostrom quote...
“The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.”
… what JDP means to refer to is the “its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal” part, not the “the AI may indeed understand that this is not what we meant” part. (Pretend the latter part doesn’t exist.)
Reasoning: The fact that the AI’s goal ended up at “maximize happiness” after being trained against the “maximize happiness” objective, instead of at whatever the programmers intended by the “maximize happiness” objective, implies that there was a moment earlier in training when the AI “misunderstood” that goal (in the sense of “misunderstand” described in my first paragraph).
JDP then holds that this won’t happen, contrary to that part of Bostrom’s statement: that training on “naïve” pointers to eudaimonia like “maximize smiles” and such will Just Work, that the SGD will point AIs at eudaimonia (or at corrigibility or whatever we meant).[1] Or, in JDP’s parlance, that the AI will “understand” what we meant by “maximize smiles” well before it’s superintelligent.
If you think that this use of “misunderstand” is wildly idiosyncratic, or that JDP picked a really bad Bostrom quote to make his point, I agree.
(Assuming I am also not misunderstanding everything, there sure is a lot of misunderstanding around.)
I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will “Just Work”. If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in “alignment by default” and push back on such things frequently whenever they’re brought up. What has happened is that the problem has gone from “not clear how you would do this even in principle, basically literally impossible with current knowledge” to merely tricky.
not the “the AI may indeed understand that this is not what we meant” part. (Pretend the latter part doesn’t exist.)
Ok, but the latter part does exist! I can’t ignore it. Like, it’s a sentence that seems almost explicitly designed to clarify that Bostrom thinks the AI will understand what we mean. So clearly, Bostrom is not saying “the AI will not understand what we mean”. Maybe he is making some other error in the book about how when the AI understands the way it does, it has to be corrigible, or that “happiness” is a confused kind of model of what an AI might want to optimize, but clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”. Like, he literally said the opposite right there, in the quote!
(JDP, you’re welcome to chime in and demonstrate that your writing was actually perfectly clear and that I’m just also failing basic reading comprehension.)
So clearly, Bostrom is not saying “the AI will not understand what we mean”
Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.
The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,[1] (b) making a statement which logically implies, as an antecedent, that “AI-when-embryo won’t understand what we meant”.[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent” is false.
JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context. I love it when blog posts have lil’ tricksy logic puzzles in them.
clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”
“However, [AI-when-superintelligence’s] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo “misunderstood” that code’s intent.]”
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.
My understanding: JDP holds that when the training process chisels a wrong goal into an AI because we gave it a wrong training objective (e. g., “maximize smiles” while we want “maximize eudaimonia”), this event could be validly described as the AI “misunderstanding” us.
So when JDP says that “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent”, and claims that this counters this Bostrom quote...
… what JDP means to refer to is the “its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal” part, not the “the AI may indeed understand that this is not what we meant” part. (Pretend the latter part doesn’t exist.)
Reasoning: The fact that the AI’s goal ended up at “maximize happiness” after being trained against the “maximize happiness” objective, instead of at whatever the programmers intended by the “maximize happiness” objective, implies that there was a moment earlier in training when the AI “misunderstood” that goal (in the sense of “misunderstand” described in my first paragraph).
JDP then holds that this won’t happen, contrary to that part of Bostrom’s statement: that training on “naïve” pointers to eudaimonia like “maximize smiles” and such will Just Work, that the SGD will point AIs at eudaimonia (or at corrigibility or whatever we meant).[1] Or, in JDP’s parlance, that the AI will “understand” what we meant by “maximize smiles” well before it’s superintelligent.
If you think that this use of “misunderstand” is wildly idiosyncratic, or that JDP picked a really bad Bostrom quote to make his point, I agree.
(Assuming I am also not misunderstanding everything, there sure is a lot of misunderstanding around.)
Plus/minus some caveats and additional bells and whistles like e. g. early stopping, I believe.
I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will “Just Work”. If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in “alignment by default” and push back on such things frequently whenever they’re brought up. What has happened is that the problem has gone from “not clear how you would do this even in principle, basically literally impossible with current knowledge” to merely tricky.
Ok, but the latter part does exist! I can’t ignore it. Like, it’s a sentence that seems almost explicitly designed to clarify that Bostrom thinks the AI will understand what we mean. So clearly, Bostrom is not saying “the AI will not understand what we mean”. Maybe he is making some other error in the book about how when the AI understands the way it does, it has to be corrigible, or that “happiness” is a confused kind of model of what an AI might want to optimize, but clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”. Like, he literally said the opposite right there, in the quote!
(JDP, you’re welcome to chime in and demonstrate that your writing was actually perfectly clear and that I’m just also failing basic reading comprehension.)
Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.
The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,[1] (b) making a statement which logically implies, as an antecedent, that “AI-when-embryo won’t understand what we meant”.[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent” is false.
JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context.
I love it when blog posts have lil’ tricksy logic puzzles in them.Yep.
“The AI may indeed understand that this is not what we meant.”
“However, [AI-when-superintelligence’s] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo “misunderstood” that code’s intent.]”
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
I see! Understandable, but yep, I think you misjudged the inferential distance there a fair bit.
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.