The Achilles Heel Hypothesis for AI

Pit­falls for AI Sys­tems via De­ci­sion The­o­retic Adversaries

This post ac­com­pa­nies a new pa­per re­lated to AI al­ign­ment. A brief out­line and in­for­mal dis­cus­sion of the ideas is pre­sented here, but of course, you should check out the pa­per for the full thing.

As progress in AI con­tinues to ad­vance at a rapid pace, it is im­por­tant to know how ad­vanced sys­tems will make choices and in what ways they may fail. When think­ing about the prospect of su­per­in­tel­li­gence, I think it’s all too easy and all too com­mon to imag­ine that an ASI would be some­thing which hu­mans, by defi­ni­tion, can’t ever out­smart. But I don’t think we should take this for granted. Even if an AI sys­tem seems very in­tel­li­gent—po­ten­tially even su­per­in­tel­li­gent—this doesn’t mean that it’s im­mune to mak­ing egre­giously bad de­ci­sions when pre­sented with ad­ver­sar­ial situ­a­tions. Thus the main in­sight of this pa­per:

The Achilles Heel hy­poth­e­sis: Be­ing a highly-suc­cess­ful goal-ori­ented agent does not im­ply a lack of de­ci­sion the­o­retic weak­nesses in ad­ver­sar­ial situ­a­tions. Highly in­tel­li­gent sys­tems can sta­bly pos­sess “Achilles Heels” which cause these vuln­er­a­bil­ities.

More pre­cisely, I define an Achilles Heel as a delu­sion which is im­pairing (re­sults in ir­ra­tional choices in ad­ver­sar­ial situ­a­tions), sub­tle (doesn’t re­sult in ir­ra­tional choices in nor­mal situ­a­tions), im­plantable (able to be in­tro­duced) and sta­ble (re­main­ing in a sys­tem re­li­ably over time).

In the pa­per, a to­tal of 8 prime can­di­dates Achilles Heels are con­sid­ered alongside ways by which they could be ex­ploited and im­planted:

  1. Corrigibility

  2. Ev­i­den­tial de­ci­sion theory

  3. Causal de­ci­sion theory

  4. Up­date­ful de­ci­sion theory

  5. Si­mu­la­tional belief

  6. Sleep­ing beauty assumptions

  7. In­finite tem­po­ral models

  8. Aver­sion to the use of sub­jec­tive priors

This was all in­spired by think­ing about how, since para­doxes can of­ten stump hu­mans, they might also fool cer­tain AI sys­tems in ways that we should an­ti­ci­pate. It sur­veys and aug­ments work in de­ci­sion the­ory in­volv­ing dilem­mas and para­doxes in con­text of this hy­poth­e­sis and makes a hand­ful of novel con­tri­bu­tions in­volv­ing im­plan­ta­tion. My hope is that this will lead to in­sights on how to bet­ter model and build ad­vanced AI. On one hand, Achilles Heels are a pos­si­ble failure mode which we want to avoid, but on the other, they are an op­por­tu­nity for build­ing bet­ter mod­els via ad­ver­sar­ial train­ing or the use of cer­tain Achilles Heels for con­tain­ment. This pa­per may also just be a use­ful refer­ence in gen­eral for the top­ics it sur­veys.

For more info, you’ll have to read it! Also feel free to con­tact me at scasper@col­lege.har­vard.edu.