AI safety: three human problems and one AI issue

Cross­posted at the In­tel­li­gent agent foun­da­tion.

There have been var­i­ous at­tempts to clas­sify the prob­lems in AI safety re­search. Our old Or­a­cle pa­per that clas­sified then-the­o­ret­i­cal meth­ods of con­trol, to more re­cent clas­sifi­ca­tions that grow out of mod­ern more con­crete prob­lems.

Th­ese all serve their pur­pose, but I think a more en­light­en­ing clas­sifi­ca­tion of the AI safety prob­lems is to look at what the is­sues we are try­ing to solve or avoid. And most of these is­sues are prob­lems about hu­mans.

Speci­fi­cally, I feel AI safety is­sues can be clas­sified as three hu­man prob­lems and one cen­tral AI is­sue. The hu­man prob­lems are:

  • Hu­mans don’t know their own val­ues (sub-is­sue: hu­mans know their val­ues bet­ter in ret­ro­spect than in pre­dic­tion).

  • Hu­mans are not agents and don’t have sta­ble val­ues (sub-is­sue: hu­man­ity it­self is even less of an agent).

  • Hu­mans have poor pre­dic­tions of an AI’s be­havi­our.

And the cen­tral AI is­sue is:

  • AIs could be­come ex­tremely pow­er­ful.

Ob­vi­ously if hu­mans were agents and knew their own val­ues and could pre­dict whether a given AI would fol­low those val­ues or not, there would be not prob­lem. Con­versely, if AIs were weak, then the hu­man failings wouldn’t mat­ter so much.

The points about hu­man val­ues is rel­a­tively straight­for­ward, but what’s the prob­lem with hu­mans not be­ing agents? Essen­tially, hu­mans can be threat­ened, tricked, se­duced, ex­hausted, drugged, mod­ified, and so on, in or­der to act seem­ingly against our in­ter­ests and val­ues.

If hu­mans were clearly defined agents, then what counts as a trick or a mod­ifi­ca­tion would be easy to define and ex­clude. But since this is not the case, we’re re­duced to try­ing to figure out the ex­tent to which some­thing like a heroin in­jec­tion is a valid way to in­fluence hu­man prefer­ences. This makes both hu­mans sus­cep­ti­ble to ma­nipu­la­tion, and hu­man val­ues hard to define.

Fi­nally, the is­sue of hu­mans hav­ing poor pre­dic­tions of AI is more gen­eral than it seems. If you want to en­sure that an AI has the same be­havi­our in the test­ing and train­ing en­vi­ron­ment, then you’re es­sen­tially try­ing to guaran­tee that you can pre­dict that the test­ing en­vi­ron­ment be­havi­our will be the same as the (pre­sum­ably safe) train­ing en­vi­ron­ment be­havi­our.

How to clas­sify meth­ods and problems

That’s well and good, but how to var­i­ous tra­di­tional AI meth­ods or prob­lems fit into this frame­work? This should give us an idea as to whether the frame­work is use­ful.

It seems to me that:

  • Friendly AI is try­ing to solve the val­ues prob­lem di­rectly.

  • IRL and Co­op­er­a­tive IRL are also try­ing to solve the val­ues prob­lem. The great­est weak­ness of these meth­ods is the not agents prob­lem.

  • Cor­rigi­bil­ity/​in­ter­rupt­ibil­ity are also ad­dress­ing the is­sue of hu­mans not know­ing their own val­ues, us­ing the sub-is­sue that hu­man val­ues are clearer in ret­ro­spect. Th­ese meth­ods also over­lap with poor pre­dic­tions.

  • AI trans­parency is aimed at get­ting round the poor pre­dic­tions prob­lem.

  • Lau­rent’s work on care­fully defin­ing the prop­er­ties of agents is mainly also about solv­ing the poor pre­dic­tions prob­lem.

  • Low im­pact and Or­a­cles are aimed squarely at pre­vent­ing AIs from be­com­ing pow­er­ful. Meth­ods that re­strict the Or­a­cle’s out­put im­plic­itly ac­cept that hu­mans are not agents.

  • Ro­bust­ness of the AI to changes be­tween test­ing and train­ing en­vi­ron­ment, degra­da­tion and cor­rup­tion, etc… en­sures that hu­mans won’t be mak­ing poor pre­dic­tions about the AI.

  • Ro­bust­ness to ad­ver­saries is deal­ing with the sub-is­sue that hu­man­ity is not an agent.

  • The mod­u­lar ap­proach of Eric Drexler is aimed at pre­vent­ing AIs from be­com­ing too pow­er­ful, while re­duc­ing our poor pre­dic­tions.

  • Log­i­cal un­cer­tainty, if solved, would re­duce the scope for cer­tain types of poor pre­dic­tions about AIs.

  • Wire­head­ing, when the AI takes con­trol of re­ward chan­nel, is a prob­lem that hu­mans don’t know their val­ues (and hence use an in­di­rect re­ward) and that the hu­mans make poor pre­dic­tions about the AI’s ac­tions.

  • Wire­head­ing, when the AI takes con­trol of the hu­man, is as above but also a prob­lem that hu­mans are not agents.

  • In­com­plete speci­fi­ca­tions are ei­ther a prob­lem of not know­ing our own val­ues (and hence miss­ing some­thing im­por­tant in the re­ward/​util­ity) or mak­ing poor pre­dic­tions (when we though that a situ­a­tion was cov­ered by our speci­fi­ca­tion, but it turned out not to be).

  • AIs mod­el­ling hu­man knowl­edge seem to be mostly about get­ting round the fact that hu­mans are not agents.

Put­ting this all in a table:

Not Agents
Poor Pre­dic­tion­sPow­er­ful
Friendly AI


Cor­rigi­bil­ity/​in­ter­rupt­ibil­ity X
AI trans­parency

Lau­rent’s work

Low im­pact and Or­a­cles

Ro­bust­ness to ad­ver­saries

Mo­du­lar ap­proach

Log­i­cal un­cer­tainty

Wire­head­ing (re­ward chan­nel) X X X
Wire­head­ing (hu­man) X
In­com­plete speci­fi­ca­tions X
AIs mod­el­ling hu­man knowl­edge

Fur­ther re­fine­ments of the framework

It seems to me that the third cat­e­gory—poor pre­dic­tions—is the most likely to be ex­pand­able. For the mo­ment, it just in­cor­po­rates all our lack of un­der­stand­ing about how AIs would be­have, but this might more use­ful to sub­di­vide.