This process can be rightfully called UNDERSTANDING and when an AI system fails at this it has FAILED TO UNDERSTAND YOU
No, the rightful way to describe what happens is that the training process generates an AI system with unintended functionality due to your failure to specify the training objective correctly. Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.
I mean, you can say that, it’s an okay way to describe things in a colloquial or metaphorical way. But I contest that it’s in any way standard language. You’re using idiosyncratic terminology and should in no way be surprised when people misunderstand (ha) you.
Honestly, if you went to modern-day LLMs and they, specialists in reading comprehension, misunderstood you, that ought to update you in the direction of “I made a bad job phrasing this”, not “it’s everyone else who’s wrong”.
(FYI, I understood what you meant in your initial reply to Habryka without this follow-up explanation, and I still thought you were phrasing it in an obviously confusing way.)
Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.
Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn’t fully match what you mean, the latter very much is and the latter is the only kind of thing that’s going to have the proper referents for concepts like “happiness”.
I mean, no mind really exists at the time the “misunderstanding” is starting to happen, no? Unless you want to call a randomly initialized NN (i. e., basically a random program) a “mind”… Which wouldn’t necessarily be an invalid frame to use. But I don’t think it’s the obviously correct frame either, and so I don’t think that people who use a mechanistic frame by default are unambiguously in error.
Therefore Bostrom expects we will not have an AI that correctly understands concepts like intelligence until after it is already superintelligent.
That is straightforwardly correct. But “there exists no AI that understands” is importantly different from “there exists an AI which misunderstands”.
Another questionable frame here is characterizing the relationship between an AI and the SGD/the training process shaping it as some sort of communication process (?), such that the AI ending up misshapen can be described as it “misunderstanding” something.
And the training process itself never becomes a mind, it starts and ends as a discrete program, so if you mean to say that it “misunderstood” something, I think that’s a type error/at best a metaphor.
(I guess it may still be valid from a point of view where you frame SGD updates as Bayesian updates, or something along those lines? But that’s also a non-standard frame.)
in practice, we seem to train the world model and understanding machine first and the policy only much later as a thin patch on top of the world model. this is not guaranteed to stay true but seems pretty durable so far. thus, the relevant heuristics are about base models not about randomly initialized neural networks.
separately, I do think randomly initialized neural networks have some strong baseline of fuzziness and conceptual corrigibility, which is in a sense what it means to have a traversible loss landscape.
No, the rightful way to describe what happens is that the training process generates an AI system with unintended functionality due to your failure to specify the training objective correctly. Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.
I mean, you can say that, it’s an okay way to describe things in a colloquial or metaphorical way. But I contest that it’s in any way standard language. You’re using idiosyncratic terminology and should in no way be surprised when people misunderstand (ha) you.
Honestly, if you went to modern-day LLMs and they, specialists in reading comprehension, misunderstood you, that ought to update you in the direction of “I made a bad job phrasing this”, not “it’s everyone else who’s wrong”.
(FYI, I understood what you meant in your initial reply to Habryka without this follow-up explanation, and I still thought you were phrasing it in an obviously confusing way.)
Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn’t fully match what you mean, the latter very much is and the latter is the only kind of thing that’s going to have the proper referents for concepts like “happiness”.
I mean, no mind really exists at the time the “misunderstanding” is starting to happen, no? Unless you want to call a randomly initialized NN (i. e., basically a random program) a “mind”… Which wouldn’t necessarily be an invalid frame to use. But I don’t think it’s the obviously correct frame either, and so I don’t think that people who use a mechanistic frame by default are unambiguously in error.
I note that in your step-by-step explanation, the last bullet is:
That is straightforwardly correct. But “there exists no AI that understands” is importantly different from “there exists an AI which misunderstands”.
Another questionable frame here is characterizing the relationship between an AI and the SGD/the training process shaping it as some sort of communication process (?), such that the AI ending up misshapen can be described as it “misunderstanding” something.
And the training process itself never becomes a mind, it starts and ends as a discrete program, so if you mean to say that it “misunderstood” something, I think that’s a type error/at best a metaphor.
(I guess it may still be valid from a point of view where you frame SGD updates as Bayesian updates, or something along those lines? But that’s also a non-standard frame.)
in practice, we seem to train the world model and understanding machine first and the policy only much later as a thin patch on top of the world model. this is not guaranteed to stay true but seems pretty durable so far. thus, the relevant heuristics are about base models not about randomly initialized neural networks.
separately, I do think randomly initialized neural networks have some strong baseline of fuzziness and conceptual corrigibility, which is in a sense what it means to have a traversible loss landscape.