Viliam comments on Linkpost: A tale of 2.5 orthogonality theses

Viliam 14 Mar 2023 21:56 UTC
2 points
0
Active deception seems unlikely, I agree with that part (weak opinion, didn’t spend much time thinking about it). At this moment, my risk model is like: “AI destroys everything the humans were not paying attention to… plus a few more things after the environment changes dramatically”.
(Humans care about 100 different things. If you train the AI, you check 10 of them, and the AI will sincerely tell you whether it cares about them or not. You make the AI care about all 10 of them and run it. The remaining 90 things are now lost. Out of the 10 preserved things, 1 was specified a little incorrectly, now it is too late to do anything about it. As a consequence of losing the 90 things, the environment changes so that 2 more of the 10 preserved things do not make much sense in the new context, so they are effectively also lost. The humanity gets to keep 7 out of 100 things they originally cared about.)