(JDP, you’re welcome to chime in and demonstrate that your writing was actually perfectly clear and that I’m just also failing basic reading comprehension.)
So clearly, Bostrom is not saying “the AI will not understand what we mean”
Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.
The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,[1] (b) making a statement which logically implies, as an antecedent, that “AI-when-embryo won’t understand what we meant”.[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent” is false.
JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context. I love it when blog posts have lil’ tricksy logic puzzles in them.
clearly that sentence is an atrocious sentence for demonstrating that “Bostrom said that the AI will not understand what we mean”
“However, [AI-when-superintelligence’s] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo “misunderstood” that code’s intent.]”
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.
(JDP, you’re welcome to chime in and demonstrate that your writing was actually perfectly clear and that I’m just also failing basic reading comprehension.)
Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.
The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,[1] (b) making a statement which logically implies, as an antecedent, that “AI-when-embryo won’t understand what we meant”.[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement “AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent” is false.
JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context.
I love it when blog posts have lil’ tricksy logic puzzles in them.Yep.
“The AI may indeed understand that this is not what we meant.”
“However, [AI-when-superintelligence’s] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo “misunderstood” that code’s intent.]”
This is correct, though that particular chain of logic doesn’t actually imply the “before superintelligence” part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it’s important context that this to me doesn’t come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn’t feel like a hard thing from the inside, so I wouldn’t expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn’t understand what I’m talking about until several people went “no I don’t get it”, that’s how basic it feels from the inside here. I now understand that no this actually isn’t obvious, the hostile tone above was frustration from not knowing that yet.
I see! Understandable, but yep, I think you misjudged the inferential distance there a fair bit.
Clearly! I’m a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
“—Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like “happiness” at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: “But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.”″
Part of why I didn’t write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.