Risks from Approximate Value Learning

Solv­ing the value learn­ing prob­lem is (IMO) the key tech­ni­cal challenge for AI safety.
How good or bad is an ap­prox­i­mate solu­tion?

EDIT for clar­ity:
By “ap­prox­i­mate value learn­ing” I mean some­thing which does a good (but sub­op­ti­mal from the per­spec­tive of safety) job of learn­ing val­ues. So it may do a good enough job of learn­ing val­ues to be­have well most of the time, and be use­ful for solv­ing tasks, but it still has a non-triv­ial chance of de­vel­op­ing dan­ger­ous in­stru­men­tal goals, and is hence an Xrisk.


1. How would de­vel­op­ing good ap­prox­i­mate value learn­ing al­gorithms effect AI re­search/​de­ploy­ment?
It would en­able more AI ap­pli­ca­tions. For in­stance, many many robotics tasks such as “smooth grasp­ing mo­tion” are difficult to man­u­ally spec­ify a util­ity func­tion for. This could have pos­i­tive or nega­tive effects:

* It could en­courage more main­stream AI re­searchers to work on value-learn­ing.

* It could en­courage more main­stream AI de­vel­op­ers to use re­in­force­ment learn­ing to solve tasks for which “good-enough” util­ity func­tions can be learned.
Con­sider a value-learn­ing al­gorithm which is “good-enough” to learn how to perform com­pli­cated, ill-speci­fied tasks (e.g. fold­ing a towel). But it’s still not quite perfect, and so ev­ery sec­ond, there is a 1100,000,000 chance that it de­cides to take over the world. A robot us­ing this al­gorithm would likely pass a year-long se­ries of safety tests and seem like a vi­able product, but would be ex­pected to de­cide to take over the world in ~3 years.
Without good-enough value learn­ing, these tasks might just not be solved, or might be solved with safer ap­proaches in­volv­ing more en­g­ineer­ing and less perfor­mance, e.g. us­ing a col­lec­tion of su­per­vised learn­ing mod­ules and hand-crafted in­ter­faces/​heuris­tics.

2. What would a par­tially al­igned AI do?
An AI pro­grammed with an ap­prox­i­mately cor­rect value func­tion might fail
* dra­mat­i­cally (see, e.g. Eliezer, on AIs “tiling the so­lar sys­tem with tiny smiley faces.”)
* rel­a­tively be­nignly (see, e.g. my ex­am­ple of an AI that doesn’t un­der­stand gus­ta­tory plea­sure)

Per­haps a more sig­nifi­cant ex­am­ple of be­nign par­tial-al­ign­ment would be an AI that has not learned all hu­man val­ues, but is cor­rigible and han­dles its un­cer­tainty about its util­ity in a de­sir­able way.