I agree with your claim as stated; 98% is overconfident.
I have in the past placed a good bit of hope on the basin of alignment idea, although my hopes were importantly different in that they start below the human level. The human level is exactly when you get large context shifts like “oh hey maybe I could escape and become god-emperor… I don’t have human limitations. If I could, maybe I should? Not even thinking about it would be foolish...” That’s when you get the context shift.
Could this basin of instruction-following still work? Sure! Maybe!
Is it likely enough by default that we should be pressing full speed ahead while barely thinking about that approach? No, obviously not! Pretty much nobody will say “oh it’s only a 50% chance of everyone dying? Well then by all means let’s rush right ahead with no more resources for safety work!”
That’s basically why I think MIRIs strategy is sound or at least well-thought out. The expert pushback to their 98% will be along the lines of “that’s far overconfident! Why, it’s only [90%-10%] likely! That is not reassuring enough for most people who care whether they or their kids get to live. (and I expect really well-thought-out estimates will not be near the lower end of that range).
The point MIRI is making is that expert estimates go as high as 98% plus. That’s their real opinion; they know the counterarguments.
I do think EY is far overconfident, and this does create a real problem for anyone who adopts his estimate. They will want to work on a pause INSTEAD of working on alignment, which I think is a severe tactical error given our current state of uncertainty. But for practical purposes, I doubt enough people will go that high, so it won’t create a problem of neglecting other possible solutions; instead it will create a few people who are pretty passionate about working for shutdown, and that’s probably a good thing.
I find it’s reasonably likely that the basin of instruction-following alignment you describe won’t work by default (the race dynamics and motivated reasoningh play a large role), but that modest improvements in either our understanding and/or the water level of average concern and/or the race incentives themselves might be enough to make it work. So efforts in theose directions are probably highly useful.
I think this discussion about the situation we’re actually in is a very useful side-effect of their publicity efforts on that book. Big projects don’t often succeed on the first try without a lot of planning. And to me the planning around alignment looks concerningly lacking. But there’s time to improve it, even in the uncomfortably possible case of short timelines!
I agree with your claim as stated; 98% is overconfident.
I have in the past placed a good bit of hope on the basin of alignment idea, although my hopes were importantly different in that they start below the human level. The human level is exactly when you get large context shifts like “oh hey maybe I could escape and become god-emperor… I don’t have human limitations. If I could, maybe I should? Not even thinking about it would be foolish...” That’s when you get the context shift.
Working through the logic made me a good bit more pessimistic. I just wrote a post on why I made that shift: LLM AGI may reason about its goals and discover misalignments by default.
And that was on top of my previous recognition that my scheme of instruction-following, laid out in Instruction-following AGI is easier and more likely than value aligned AGI, has problems I hadn’t grappled with (even though I’d gone into some depth): Problems with instruction-following as an alignment target.
Could this basin of instruction-following still work? Sure! Maybe!
Is it likely enough by default that we should be pressing full speed ahead while barely thinking about that approach? No, obviously not! Pretty much nobody will say “oh it’s only a 50% chance of everyone dying? Well then by all means let’s rush right ahead with no more resources for safety work!”
That’s basically why I think MIRIs strategy is sound or at least well-thought out. The expert pushback to their 98% will be along the lines of “that’s far overconfident! Why, it’s only [90%-10%] likely! That is not reassuring enough for most people who care whether they or their kids get to live. (and I expect really well-thought-out estimates will not be near the lower end of that range).
The point MIRI is making is that expert estimates go as high as 98% plus. That’s their real opinion; they know the counterarguments.
I do think EY is far overconfident, and this does create a real problem for anyone who adopts his estimate. They will want to work on a pause INSTEAD of working on alignment, which I think is a severe tactical error given our current state of uncertainty. But for practical purposes, I doubt enough people will go that high, so it won’t create a problem of neglecting other possible solutions; instead it will create a few people who are pretty passionate about working for shutdown, and that’s probably a good thing.
I find it’s reasonably likely that the basin of instruction-following alignment you describe won’t work by default (the race dynamics and motivated reasoningh play a large role), but that modest improvements in either our understanding and/or the water level of average concern and/or the race incentives themselves might be enough to make it work. So efforts in theose directions are probably highly useful.
I think this discussion about the situation we’re actually in is a very useful side-effect of their publicity efforts on that book. Big projects don’t often succeed on the first try without a lot of planning. And to me the planning around alignment looks concerningly lacking. But there’s time to improve it, even in the uncomfortably possible case of short timelines!
I agree with the main thrust of your perspective here. Thank you for your comment, and the posts you linked.