In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of “happiness”, one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.
I don’t quite understand the distinction your’e drawing here.
In both cases the AI was never trying to pursue happiness. In both cases it was pursuing something else, shmappiness, that correlated strongly with causing happiness in the training but not deployment environments. In both cases strength matters for making this disastrous as it will find more disastrous ways of pursuing schmappiness, It’s just that the it is pursuing different varieties of shmappiness in the different cases.
I don’t have a view on whether “goal misgeneralisation” as a term is optimal for this kind of thing.
I feel like I don’t understand how this model explains the biggest mystery of expereinces sometimes having the reverse impact on your beliefs vs. what they should.
Shouldn’t your experience still be less terrifying than you expected it to be, becuase you’re combining your dogs-are-terrifying-at-level-10 prior with the raw evidence (however constricted that channel is), so your update should still be against dogs being terrifying at level 10 (maybe level 9.9?)?
Maybe the answer is the thing smountjoy said below in response to your caption, that we don’t have gradations in our beliefs about things—dogs are either terrifying or not—and then you have another example of dogs being terrifying to update with. FWIW that sounds unlikley to me—people do seem to tend to have gradations in how evil republicans are or how terrifying dogs are in my experience. Though mabe that gets disabled in these cases, which seems like would explain it.