Richard Ngo writes
Since Stuart Russell’s proposed alignment solution in Human Compatible is the most publicly-prominent alignment agenda, I should be more explicit about my belief that it almost entirely fails to address the core problems I expect on realistic pathways to AGI.
Specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the “uncertainty” part as a red herring. https://arbital.com/p/updated_deference/… In other words, Russell gives a wrong-way reduction.
I originally included CIRL in my curriculum (https://docs.google.com/document/d/1mTm_sT2YQx3mRXQD6J2xD2QJG1c3kHyvX8kQc_IQ0ns/edit?usp=drivesdk…) out of some kind of deferent/catering to academic mainstream instinct. Probably a mistake; my current annoyance about deferential thinking has reminded me to take it out.
My impression is that ~everyone I know in the alignment community is very pessimistic about SR’s agenda. Does it sound right that your view is basically a consensus? (There’s prob some selection bias in who I know).
I think it’s fair to say that this is a pretty widespread opinion. Partly it’s because Stuart is much more skeptical of deep learning (and even machine learning more generally!) than almost any other alignment researcher, and so he’s working in a different paradigm.
Is Richard correct and if so why? (I would also like a clearer explanation why Richard is skeptical of Stuart’s agenda. I agree that the reframing doesn’t completely solve the problem, but I don’t understand why it can’t be a useful piece).