At step 8, why is the AI motivated to care about the idealized goal rather than just the reward signal? Are we assuming that the reward signal is determined by performance wrt the ideal goal?
That is the aim. It’s easy to program an AI that doesn’t care too much about the reward signal—the trick is to find a way that it doesn’t care in a specific way that aligns it with our preferences.
eg what would you do if you had been told to maximise some goal, but were told that your reward signal would be corrupted and over-simplified? You can start doing some things in that situation to maximise your chance of not-wireheading; I want to program the AI to do similarly.