Hackable Rewards as a Safety Valve?

Read­ing Deep­mind’s lat­est re­search and ac­com­pa­ny­ing blog­post, I wanted to high­light an un­der-ap­pre­ci­ated as­pect of safety. As a bit of back­ground, Car­los Perez points out Josha Bach’s “Le­bowski the­o­rem,” which states that “no su­per­in­tel­li­gent AI is go­ing to bother with a task that is harder than hack­ing its re­ward func­tion.” Given that, I see a po­ten­tial per­verse effect of some types of al­ign­ment re­search—es­pe­cially re­search into em­bed­ded agency and ro­bust al­ign­ment which makes AI un­in­ter­ested in re­ward tam­per­ing. (Epistemic Sta­tus: my con­fi­dence in the ar­gu­ment is mod­er­ate, and I am more con­fi­dent in the ear­lier claims.)

In gen­eral, un­safe AI is far more likely to tam­per with its re­ward func­tion than to find more dis­tant (and ar­guably more prob­le­matic) ways to tam­per with the world to max­i­mize its ob­jec­tive. (epistemic sta­tus: fairly high con­fi­dence) Once an AI is smart enough to spend its time re­ward hack­ing, then wast­ing time on de­vel­op­ing greater in­tel­li­gence is un­needed. For that rea­son, this the­o­rem seems likely to func­tion as at least a mild safety valve. It’s only if we close this valve too tightly that we would plau­si­bly see ML that reached hu­man-level in­tel­li­gence. At that point, of course, we should ex­pect that the AI will be­gin to munchkin the sys­tem, just as a mod­er­ately clever hu­man would. And anti-munchkin-ing is a nar­row in­stance of se­cu­rity more gen­er­ally.

Se­cu­rity gen­er­ally is like cryp­tog­ra­phy nar­rowly in an im­por­tance sense; it’s easy to build a sys­tem that you your­self can’t break, but very challeng­ing to build one that oth­ers can­not ex­ploit. (Epistemic sta­tus: more spec­u­la­tive) This means that even if our best efforts go to­wards safety, an AI seems very un­likely to need more than “mild” su­per­in­tel­li­gence to break it—un­less it’s been so well al­igned that it doesn’t want to hack its ob­jec­tive func­tion.

This logic im­plies (Epistemic sta­tus: most spec­u­la­tive, still with some con­fi­dence) that mod­er­ate progress in AI safety is po­ten­tially far more dan­ger­ous than very lit­tle progress—and raises crit­i­cal ques­tions of how close to this un­safe un­canny valley we cur­rently are, and how wide the valley is.