[Question] Best reasons for pessimism about impact of impact measures?

Habryka re­cently wrote (em­pha­sis mine):

My in­side views on AI Align­ment make me think that work on im­pact mea­sures is very un­likely to re­sult in much con­crete progress on what I per­ceive to be core AI Align­ment prob­lems, and I have talked to a va­ri­ety of other re­searchers in the field who share that as­sess­ment. I think it’s im­por­tant that this grant not be viewed as an en­dorse­ment of the con­crete re­search di­rec­tion that Alex is pur­su­ing, but only as an en­dorse­ment of the higher-level pro­cess that he has been us­ing while do­ing that re­search.

As such, I think it was a nec­es­sary com­po­nent of this grant that I have talked to other peo­ple in AI Align­ment whose judg­ment I trust, who do seem ex­cited about Alex’s work on im­pact mea­sures. I think I would not have recom­mended this grant, or at least this large of a grant amount, with­out their en­dorse­ment. I think in that case I would have been wor­ried about a risk of di­vert­ing at­ten­tion from what I think are more promis­ing ap­proaches to AI Align­ment, and a po­ten­tial dilu­tion of the field by in­tro­duc­ing a set of (to me) some­what du­bi­ous philo­soph­i­cal as­sump­tions.

I’m in­ter­ested in learn­ing about the in­tu­itions, ex­pe­rience, and facts which in­form this pes­simism. As such, I’m not in­ter­ested in mak­ing any ar­gu­ments to the con­trary in this post; any push­back I provide in the com­ments will be with clar­ifi­ca­tion in mind.

There are two rea­sons you could be­lieve that “work on im­pact mea­sures is very un­likely to re­sult in much con­crete progress on… core AI Align­ment prob­lems”. First, you might think that the im­pact mea­sure­ment prob­lem is in­tractable, so work is un­likely to make progress. Se­cond, you might think that even a full solu­tion wouldn’t be very use­ful.

Over the course of 5 min­utes by the clock, here are the rea­sons I gen­er­ated for pes­simism (which I ei­ther presently agree with or at least find it rea­son­able that an in­tel­li­gent critic would raise the con­cern on the ba­sis of cur­rently-pub­lic rea­son­ing):

  • Declar­a­tive knowl­edge of a solu­tion to im­pact mea­sure­ment prob­a­bly wouldn’t help us do value al­ign­ment, figure out em­bed­ded agency, etc.

  • We want to figure out how to tran­si­tion to a high-value sta­ble fu­ture, and it just isn’t clear how im­pact mea­sures help with that.

  • Com­pet­i­tive and so­cial pres­sures in­cen­tivize peo­ple to cut cor­ners on safety mea­sures, es­pe­cially those which add over­head.

    • Com­pu­ta­tional over­head.

    • Im­ple­men­ta­tion time.

    • Train­ing time, as­sum­ing they start with low ag­gres­sive­ness and dial it up slowly.

  • Depend­ing on how “clean” of an im­pact mea­sure you think we can get, maybe it’s way harder to get low-im­pact agents to do use­ful things.

    • Maybe we can get a clean one, but only for pow­er­ful agents.

    • Maybe the im­pact mea­sure misses im­pact­ful ac­tions if you can’t pre­dict at near hu­man level.

  • In a world where we know how to build pow­er­ful AI but not how to al­ign it (which is ac­tu­ally prob­a­bly the sce­nario in which im­pact mea­sures do the most work), we play a very un­fa­vor­able game while we use low-im­pact agents to some­how tran­si­tion to a sta­ble, good fu­ture: the first per­son to set the ag­gres­sive­ness too high, or to dis­card the im­pact mea­sure en­tirely, ends the game.

  • In a More re­al­is­tic tales of doom-es­que sce­nario, it isn’t clear how im­pact helps pre­vent “grad­u­ally drift­ing off the rails”.

Paul raised con­cerns along these lines:

We’d like to build AI sys­tems that help us re­solve the tricky situ­a­tion that we’re in. That help de­sign and en­force agree­ments to avoid tech­nolog­i­cal risks, build bet­ter-al­igned AI, ne­go­ti­ate with other ac­tors, pre­dict and man­age the im­pacts of AI, im­prove our in­sti­tu­tions and policy, etc.

I think the de­fault “ter­rible” sce­nario is one where in­creas­ingly pow­er­ful AI makes the world change faster and faster, and makes our situ­a­tion more and more com­plex, with hu­mans hav­ing less and less of a han­dle on what is go­ing on or how to steer it in a pos­i­tive di­rec­tion. Where we must rely on AI to get any­where at all, and thereby give up the abil­ity to choose where we are go­ing.

That may ul­ti­mately cul­mi­nate with a catas­trophic bang, but if it does it’s not go­ing to be be­cause we wanted the AI to have a small im­pact and it had a large im­pact. It’s prob­a­bly go­ing to be be­cause we have a very limited idea what is go­ing on, but we don’t feel like we have the breath­ing room to step back and chill out (at least not for long) be­cause we don’t be­lieve that ev­ery­one else is go­ing to give us time.

If I’m try­ing to build an AI to help us nav­i­gate an in­creas­ingly com­plex and rapidly-chang­ing world, what does “low im­pact” mean? In what sense do the ter­rible situ­a­tions in­volve higher ob­jec­tive im­pact than the in­tended be­hav­iors?

(And re­al­is­ti­cally I doubt we’ll fail at al­ign­ment with a bang—it’s more likely that the world will just drift off the rails over the course of a few months or years. The in­tu­ition that we wouldn’t let things go off the rails grad­u­ally seems like the same kind of wish­ful think­ing that pre­dicts war or slow-rol­ling en­vi­ron­men­tal dis­asters should never hap­pen.)

It seems like “low ob­jec­tive im­pact” is what we need once we are in the un­sta­ble situ­a­tion where we have the tech­nol­ogy to build an AI that would quickly and rad­i­cally trans­form the world, but we have all de­cided not to and so are pri­mar­ily con­cerned about rad­i­cally trans­form­ing the world by ac­ci­dent. I think that’s a co­her­ent situ­a­tion to think about and plan for, but we shouldn’t mis­take it for the main­line. (I per­son­ally think it is quite un­likely, and it would definitely be un­prece­dented, though you could still think it’s the best hope if you were very pes­simistic about what I con­sider “main­line” al­ign­ment.)