Imitation learning considered unsafe?

This post states an ob­ser­va­tion which I think a num­ber of peo­ple have had, but which hasn’t been writ­ten up (AFAIK). I find it one of the more trou­bling out­stand­ing is­sues with a num­ber of pro­pos­als for AI al­ign­ment.

1) Train­ing a flex­ible model with a rea­son­able sim­plic­ity prior to imi­tate (e.g.) hu­man de­ci­sions (e.g. via be­hav­ioral clon­ing) should pre­sum­ably yield a good ap­prox­i­ma­tion of the pro­cess by which hu­man judg­ments arise, which in­volves a plan­ning pro­cess.

2) We shouldn’t ex­pect to learn ex­actly the cor­rect pro­cess, though.

3) There­fore imi­ta­tion learn­ing might pro­duce an AI which im­ple­ments an un­al­igned plan­ning pro­cess, which seems likely to have in­stru­men­tal goals, and be dan­ger­ous.

Ex­am­ple: The hu­man might be do­ing plan­ning over a bounded hori­zon of time-steps, or with a bounded util­ity func­tion, and the AI might in­fer a ver­sion of the plan­ning pro­cess that doesn’t bound hori­zon or util­ity.

Clar­ify­ing note: Imi­tat­ing a hu­man is just one ex­am­ple; the key fea­ture of the hu­man is that the pro­cess gen­er­at­ing their de­ci­sions is (ar­guably) well-mod­eled as in­volv­ing plan­ning over a long hori­zon.


  • The hu­man may have priv­ileged ac­cess to con­text in­form­ing their de­ci­sion; with­out that con­text, the solu­tion may look very different

  • Mis­takes in imi­tat­ing the hu­man may be rel­a­tively harm­less; the ap­prox­i­ma­tion may be good enough

  • We can re­strict the model fam­ily with the spe­cific in­ten­tion of pre­vent­ing plan­ning-like solutions

Over­all, I have a sig­nifi­cant amount of un­cer­tainty about the sig­nifi­cance of this is­sue, and I would like to see more thought re­gard­ing it.