I don’t think I grok the distinction here: (1) just seems to me like a particular case of (2).
If the optimum of the AI-happiness concept is found in opiates (which it can only attain in deployment) (1), that’s just because all along what was causing its apparently correct behavior was a balance of different (or even only one) other goals, and this balance changes in deployment (2).
Said another way: the difference between the AI pursuing one clean alt-happy concept, as opposed to a thousand shards of desire, seems only quantitative. Both are different instances of the same high-level failure: the AI got a wrong goal.
Analogously, “get an instance of the AI’s concept to match the concept you hoped to transmit” and “figure out how to get the AI to care about one concept in particular” seem like the same problem to solve: getting the right goal into the AI.
Maybe in this post you are indeed only pointing at this quantitative difference: sometimes we get “a clean goal” but the wrong one, sometimes we don’t even know how to get a clean goal, and in the latter case this has to be fixed “before” we can even start trying to get the right goal. If that were the case, I’d feel skeptical about this being a useful difference, since it seems like a fuzzy spectrum that doesn’t carve reality at its joints. As more intuition for this, what would it even mean to “figure out how to get the AI to care about one concept in particular”, without already knowing how to instill a concrete goal into the AI? The only thing I can think of is “we can ensure the AI only gets one clean goal, but not that it’s the correct one”, but I think what constitutes “one clean goal” as opposed to a thousand shards of desire is ill-defined, because it is ontology-subjective.
But I’m not sure this is it, because I neither parse the second part of the post. You point at not noticing the distinction as a cause for people not worrying enough about goal misgeneralization in general, or better said, believing (1) will be solved, and with it (2). But clearly what’s causing people to ignore these issues (as you exemplified it) is not missing this distinction, but missing a basic fact about goals: the difference between the AI knowing something and the AI caring about something. I don’t see how the failure to see this basic fact is intertwined with your distinction.
I don’t think I grok the distinction here: (1) just seems to me like a particular case of (2).
If the optimum of the AI-happiness concept is found in opiates (which it can only attain in deployment) (1), that’s just because all along what was causing its apparently correct behavior was a balance of different (or even only one) other goals, and this balance changes in deployment (2).
Said another way: the difference between the AI pursuing one clean alt-happy concept, as opposed to a thousand shards of desire, seems only quantitative. Both are different instances of the same high-level failure: the AI got a wrong goal.
Analogously, “get an instance of the AI’s concept to match the concept you hoped to transmit” and “figure out how to get the AI to care about one concept in particular” seem like the same problem to solve: getting the right goal into the AI.
Maybe in this post you are indeed only pointing at this quantitative difference: sometimes we get “a clean goal” but the wrong one, sometimes we don’t even know how to get a clean goal, and in the latter case this has to be fixed “before” we can even start trying to get the right goal. If that were the case, I’d feel skeptical about this being a useful difference, since it seems like a fuzzy spectrum that doesn’t carve reality at its joints. As more intuition for this, what would it even mean to “figure out how to get the AI to care about one concept in particular”, without already knowing how to instill a concrete goal into the AI? The only thing I can think of is “we can ensure the AI only gets one clean goal, but not that it’s the correct one”, but I think what constitutes “one clean goal” as opposed to a thousand shards of desire is ill-defined, because it is ontology-subjective.
But I’m not sure this is it, because I neither parse the second part of the post. You point at not noticing the distinction as a cause for people not worrying enough about goal misgeneralization in general, or better said, believing (1) will be solved, and with it (2). But clearly what’s causing people to ignore these issues (as you exemplified it) is not missing this distinction, but missing a basic fact about goals: the difference between the AI knowing something and the AI caring about something. I don’t see how the failure to see this basic fact is intertwined with your distinction.