Pursuing convergent instrumental subgoals on the user’s behalf doesn’t always require good priors

I recom­mend read­ing Scal­able AI con­trol be­fore read­ing this post.

In par­tic­u­lar, in the “Hard val­ues, easy val­ues” sec­tion of this post, Paul writes:

My state­ment of the con­trol prob­lem is only re­ally mean­ingful be­cause there are in­stru­men­tal sub­goals that are shared (or are ex­tremely similar) be­tween many differ­ent val­ues, which let us com­pare the effi­cacy with which agents pur­sue those differ­ent val­ues. Perfor­mance on these very similar sub­goals should be used as the perfor­mance met­ric when in­ter­pret­ing my defi­ni­tion of AI con­trol prob­lem.

In fact even if we only re­solved the prob­lem for the similar-sub­goals case, it would be pretty good news for AI safety. Catas­trophic sce­nar­ios are mostly caused by our AI sys­tems failing to effec­tively pur­sue con­ver­gent in­stru­men­tal sub­goals on our be­half, and these sub­goals are by defi­ni­tion shared by a broad range of val­ues.

Con­ver­gent in­stru­men­tal sub­goals are mostly about gain­ing power. For ex­am­ple, gain­ing money is a con­ver­gent in­stru­men­tal sub­goal. If some in­di­vi­d­ual (hu­man or AI) has con­ver­gent in­stru­men­tal sub­goals pur­sued well on their be­half, they will gain power. If the most effec­tive con­ver­gent in­stru­men­tal sub­goal pur­suit is di­rected to­wards giv­ing hu­mans more power (rather than giv­ing alien AI val­ues more power), then hu­mans will re­main in con­trol of a high per­centage of power in the world.

If the world is not severely dam­aged in a way that pre­vents any agent (hu­man or AI) from even­tu­ally coloniz­ing space (e.g. se­vere nu­clear win­ter), then the per­centage of the cos­mic en­dow­ment that hu­mans have ac­cess to will be roughly close to to the per­centage of power that hu­mans have con­trol of at the time of space coloniza­tion. So the most rele­vant fac­tors for the com­po­si­tion of the uni­verse are (a) whether any­one at all can take ad­van­tage of the cos­mic en­dow­ment, and (b) the long-term bal­ance of power be­tween differ­ent agents (hu­mans and AIs).

I ex­pect that en­sur­ing that the long-term bal­ance of power fa­vors hu­mans con­sti­tutes most of the AI al­ign­ment prob­lem, and that other parts of the AI al­ign­ment prob­lem (e.g. en­sur­ing AIs are benefi­cial in the short term, en­sur­ing that AI sys­tems don’t cause global catas­trophic risks that cause the cos­mic en­dow­ment to be­come un­available to any agent) will be eas­ier to solve af­ter think­ing about this part of the prob­lem. So I’m go­ing to fo­cus on power ac­qui­si­tion for now.

Pri­ors and mul­ti­di­men­sional power

Con­ver­gent in­stru­men­tal sub­goals aren’t to­tally con­ver­gent, since power is mul­ti­di­men­sional, and some types of power are more use­ful for some val­ues.

Sup­pose the sun is ei­ther go­ing to turn green or blue in 10 years. No one knows which color it will turn; peo­ple dis­agree, and it seems like their be­liefs about the sun are ir­rec­on­cilable be­cause they re­sult from differ­ent pri­ors. The peo­ple who pre­dict the sun will turn green (or equiv­a­lently, care more about fu­tures in which the sun is green) buy more green-ab­sorb­ing so­lar pan­els, while those who pre­dict the sun will turn blue will buy more blue-ab­sorb­ing so­lar pan­els. How could we mea­sure how much power differ­ent peo­ple have?

In situ­a­tions like this, it seems wrong to re­duce power to a sin­gle scalar; there are at least 2 scalars in­volved in this situ­a­tion (how much power some­one has in fu­tures where the sun turns green, ver­sus in fu­tures where the sun turns blue).

For the AI to gain power on the user’s be­half, it should gain the kind of power the user cares about. If the user thinks the sun will turn green, then the AI should buy green-ab­sorb­ing so­lar pan­els.

What if the user hasn’t made up their mind about which color the sun will be, or it’s hard for the AI to elicit the user’s be­liefs for some other rea­son? Then the AI could pur­sue a con­ser­va­tive strat­egy, in which the user does not lose power in ei­ther pos­si­ble world. In the case of so­lar pan­els, if 60% of the so­lar pan­els that ev­ery­one buys ab­sorb green, then the AI should in­vest 60% of the user’s so­lar panel bud­get in green-ab­sorb­ing so­lar pan­els and 40% in blue-ab­sorb­ing so­lar pan­els. This way, the user gen­er­ates the same per­centage of the en­ergy in each pos­si­ble world, and thus has the same amount of rel­a­tive power. This is sub­op­ti­mal com­pared to if the user had more defined be­liefs about the sun, but the user isn’t any worse off than they were be­fore, so this seems fine.

I think this is an im­por­tant ob­ser­va­tion! It means that it isn’t always nec­es­sary for an AI sys­tem to have good pri­ors about hard-to-ver­ify facts (such as the even­tual color of the sun), as long as it’s pos­si­ble to es­ti­mate the effec­tive pri­ors of the agents who the user wants to be com­pet­i­tive with. In par­tic­u­lar, if there is some “bench­mark” un­al­igned AI sys­tem, and it is pos­si­ble to de­ter­mine that AI sys­tem’s effec­tive prior over facts like the color of the sun, then it should be pos­si­ble to build an al­igned AI sys­tem to use a similar prior and thereby be com­pet­i­tive with the un­al­igned AI sys­tem in all pos­si­ble fu­tures.

This doesn’t only ap­ply to pri­ors, it also ap­plies to things like dis­count rates (which are kind of like “pri­ors about which times mat­ter”) and prefer­ences about which parts of the uni­verse are best to colonize (which are kind of like “pri­ors about which lo­ca­tions mat­ter”). In gen­eral, it seems like “es­ti­mat­ing what types of power a bench­mark sys­tem will try ac­quiring and then de­sign­ing an al­igned AI sys­tem that ac­quires the same types of power for the user” is a gen­eral strat­egy for mak­ing an al­igned AI sys­tem that is com­pet­i­tive with a bench­mark un­al­igned AI sys­tem.