Prize for probable problems

Sum­mary: I’m go­ing to give a $10k prize to the best ev­i­dence that my preferred ap­proach to AI safety is doomed. Sub­mit by com­ment­ing on this post with a link by April 20.

I have a par­tic­u­lar vi­sion for how AI might be al­igned with hu­man in­ter­ests, re­flected in posts at ai-al­ign­ment.com and cen­tered on iter­ated am­plifi­ca­tion.

This vi­sion has a huge num­ber of pos­si­ble prob­lems and miss­ing pieces; it’s not clear whether these can be re­solved. Many peo­ple en­dorse this or a similar vi­sion as their cur­rent fa­vored ap­proach to al­ign­ment, so It would be ex­tremely valuable to learn about dealbreak­ers as early as pos­si­ble (whether to ad­just the vi­sion or aban­don it).

Here’s the plan:

  • If you want to ex­plain why this ap­proach is doomed, ex­plore a rea­son it may be doomed, or ar­gue that it’s doomed, I strongly en­courage you to do that.

  • Post a link to any rele­vant re­search/​ar­gu­ment/​ev­i­dence (a pa­per, blog post, repo, what­ever) in the com­ments on this post.

  • The con­test closes April 20.

  • You can sub­mit con­tent that was pub­lished be­fore this prize was an­nounced.

  • I’ll use some pro­cess to pick my fa­vorite 1-3 con­tri­bu­tions. This might in­volve del­e­gat­ing to other peo­ple or might in­volve me just pick­ing. I make no promise that my de­ci­sions will be defen­si­ble.

  • I’ll dis­tribute (at least) $10k amongst my fa­vorite con­tri­bu­tions.

If you think that some other use of this money or some other kind of re­search would be bet­ter for AI al­ign­ment, I en­courage you to ap­ply for fund­ing to do that (or just to say so in the com­ments).

This prize is or­thog­o­nal and un­re­lated to the broader AI al­ign­ment prize. (Re­minder: the next round closes March 31. Feel free to sub­mit some­thing to both.)

This con­test is not in­tended to be “fair”—the ideas I’m in­ter­ested in have not been ar­tic­u­lated clearly, so even if they are to­tally wrong-headed it may not be easy to ex­plain why. The point of the ex­er­cise is not to prove that my ap­proach is promis­ing be­cause no one can prove it’s doomed. The point is just to have a slightly bet­ter un­der­stand­ing of the challenges.

Back­ground on what I’m look­ing for

I’m most ex­cited about par­tic­u­larly thor­ough crit­i­cism that ei­ther makes tight ar­gu­ments or “plays both sides”—points out a prob­lem, ex­plores plau­si­ble re­sponses to the prob­lem, and shows that nat­u­ral at­tempts to fix the prob­lem sys­tem­at­i­cally fail.

If I thought I had a solu­tion to the al­ign­ment prob­lem I’d be in­ter­ested in high­light­ing any pos­si­ble prob­lem with my pro­posal. But that’s not the situ­a­tion yet; I’m try­ing to ex­plore an ap­proach to al­ign­ment and I’m look­ing for ar­gu­ments that this ap­proach will run into in­su­per­a­ble ob­sta­cles. I’m already aware that there are plenty of pos­si­ble prob­lems. So a con­vinc­ing ar­gu­ment is try­ing to es­tab­lish a uni­ver­sal quan­tifier over po­ten­tial solu­tions to a pos­si­ble prob­lem.

On the other hand, I’m hop­ing that we’ll solve al­ign­ment in a way that know­ably works un­der ex­tremely pes­simistic as­sump­tions, so I’m fine with ar­gu­ments that make weird as­sump­tions or con­sider weird situ­a­tions /​ ad­ver­saries.

Ex­am­ples of in­ter­lock­ing ob­sta­cles I think might to­tally kill my ap­proach:

  • Am­plifi­ca­tion may be doomed be­cause there are im­por­tant parts of cog­ni­tion that are too big to safely learn from a hu­man, yet can’t be safely de­com­posed. (Re­lat­edly, se­cu­rity am­plifi­ca­tion might be im­pos­si­ble.)

  • A clearer in­spec­tion of what am­plifi­ca­tion needs to do (e.g. build­ing a com­pet­i­tive model of the world in which an am­plified hu­man can de­tect in­cor­rigible be­hav­ior) may show that am­plifi­ca­tion isn’t get­ting around the fun­da­men­tal prob­lems that MIRI is in­ter­ested in and will only work if we de­velop a much deeper un­der­stand­ing of effec­tive cog­ni­tion.

  • There may be kinds of er­rors (or ma­lign op­ti­miza­tion) that are am­plified by am­plifi­ca­tion and can’t be eas­ily con­trol­led (or this con­cern might be pre­dictably hard to ad­dress in ad­vance by the­ory+ex­per­i­ment).

  • Cor­rigi­bil­ity may be in­co­her­ent, or may not ac­tu­ally be easy enough to learn, or may not con­fer the kind of ro­bust­ness to pre­dic­tion er­rors that I’m count­ing on, or may not be pre­served by am­plifi­ca­tion.

  • Satis­fy­ing safety prop­er­ties in the worst case (like cor­rigi­bil­ity) may be im­pos­si­ble. See this post for my cur­rent thoughts on plau­si­ble tech­niques. (I’m happy to pro­vi­sion­ally grant that op­ti­miza­tion dae­mons would be catas­trophic if you couldn’t train ro­bust mod­els.)

  • In­formed over­sight might be im­pos­si­ble even if am­plifi­ca­tion works quite well. (This is most likely to be im­pos­si­ble in the con­text of de­ter­min­ing what be­hav­ior is catas­trophic.)

I value ob­jec­tions but prob­a­bly won’t have time to en­gage sig­nifi­cantly with most of them. That said: (a) I’ll be able to en­gage in a limited way, and will en­gage with ob­jec­tions that sig­nifi­cantly shift my view, (b) thor­ough ob­jec­tions can pro­duce a lot of value even if no pro­po­nent pub­li­cly en­gages with them, since they can be con­vinc­ing on their own, (c) in the medium term I’m op­ti­mistic about start­ing a broader dis­cus­sion about iter­ated am­plifi­ca­tion which in­volves pro­po­nents other than me.

I think our long-term goal should be to find, for each pow­er­ful AI tech­nique, an ana­log of that tech­nique that is al­igned and works nearly as well. My cur­rent work is try­ing to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for pow­er­ful AI sys­tems in the short term, that they are par­tic­u­larly hard cases for al­ign­ment, and that they are likely to turn up al­ign­ment tech­niques that are very gen­er­ally ap­pli­ca­ble. So for now I’m not try­ing to be com­pet­i­tive with other kinds of AI sys­tems.