Jisk, formerly Jacob. (And when Jacobs are locally scarce, still Jacob.)
LW has gone downhill a lot from its early days and I disapprove of most of the moderation choices but I’m still, sometimes, here.
It should be possible to easily find me from the username I use here, though not vice versa, for interview reasons.
I believe this to be a severe misunderstanding of corrigibility. It may on its own be fatal to this argument; I’m unsure.
Distinguish three things:
Do what I say
Do what I mean
Do what I want
‘Do what I say’ is obviously bad—from Tithonus to the Sorcerer’s Apprentice and Amelia Bedelia we have extensive cultural warnings against it. ‘Do what I want’ is the ideal, the AI which does not need to be asked—in short, Friendly. CAST AI is neither. It is ‘Do what I mean.’ It will determine what you intended, and warn you about negative consequences, but if you clarify that you want it to do it anyway regardless of the reasons it believes (arguendo, correctly) that you will regret it, it will still proceed. In the limit it may suggest tasks it thinks you may desire, or ask for broad permission for actions on your behalf, but it will not act beyond its instructions. This is at the root of both why it is easier than value alignment and why it is safer to gradually approach than value alignment.
That does not imply it is successful. CAST is valuable because, if true, at every stage it is amenable to being corrected if its interpretation of the principal’s desires is in error. As complexity grows, so does the complexity of those errors. We should expect some detailed errors to be hidden before the scale and complexity of the AI’s potential actions grows; if this causes the helpful property to disappear as correction grows impossible, that reduces the value of CAST significantly.
I did not read any of the CAST sequence as claiming a relative advantage, but an absolute one. I read it as saying ‘Here is an approach with a high chance of success, and a much higher feasibility than value alignment.’ If these caveats leave it as best of a bad lot, but no longer with a high chance of success, then they are appropriate.
Also, if corrigibility fails only as power and intelligence grows, then it is effectively deceptively aligned.