A possible training procedure for human-imitators

Some pro­posed ways to train a sys­tem to imi­tate a hu­man imi­tate a hu­man in­volve hav­ing one sys­tem imi­tate a hu­man, while an­other sys­tem tries to tell the first sys­tem apart from an ac­tual hu­man. If the first sys­tem can get clas­sified as hu­man by the sec­ond sys­tem, then (one might think) it is imi­tat­ing a hu­man well, as long as the sec­ond sys­tem is suffi­ciently smart.

I de­scribed a pos­si­ble prob­lem with these ap­proaches in a LessWrong thread:

Here’s a sce­nario that doesn’t seem com­pletely im­plau­si­ble. Sup­pose Bob is some­one whose pub­lic key is eas­ily available on the in­ter­net. The first AI will read things on the in­ter­net and out­put a mes­sage. Some of the mes­sage will get put on the pub­lic in­ter­net. Bob sus­pects that the AI might have se­cretly sent him a mes­sage (e.g. giv­ing him ad­vice on which stocks to buy). So he tries us­ing his pri­vate key to de­crypt some of the AI’s out­put (e.g. the low­est-or­der bits in some images the AI has out­put).

Know­ing that Bob (or some­one else like Bob) will likely do this, the AI (which has ac­cess to Bob’s pub­lic key, due to read­ing the in­ter­net) en­crypts a mes­sage with Bob’s pub­lic key, and puts the ci­pher­text in the low­est-or­der bits of some image. The ci­pher­text looks like ran­dom noise to the sec­ond AI. Maybe the plain­text con­tains both ac­tual stock tips and some dan­ger­ous mes­sages that trick Bob into run­ning an un­re­strained copy of the first AI (I’m as­sum­ing the AI is good at psy­cholog­i­cal ma­nipu­la­tion).

As Paul says later in the thread, the un­der­ly­ing prob­lem is that it is easy for the first AI to change its out­put in an im­por­tant way with­out the sec­ond AI notic­ing.

Here’s a pro­posed way of im­ple­ment­ing hu­man-imi­ta­tion while avoid­ing this prob­lem. The hu­man-imi­ta­tor is a sin­gle AI that is given a pre­fix of a string pro­duced by a hu­man and pre­dicts the next bit. It is re­warded for pre­dict­ing the next bit ac­cu­rately (us­ing some proper scor­ing rule). We can use this sys­tem to imi­tate a hu­man by sam­pling from its im­plied prob­a­bil­ity dis­tri­bu­tion over strings, bit by bit. If the sys­tem is very good at pre­dict­ing the next bit in a hu­man-pro­duced string, then this im­plied prob­a­bil­ity dis­tri­bu­tion will be an ac­cu­rate pre­dic­tion of the hu­man string.

Un­for­tu­nately, pre­dict­ing the string bit-by-bit might be com­pu­ta­tion­ally harder than pro­duc­ing a string that is hard to tell apart from what a hu­man would pro­duce. Here’s an ex­ten­sion that makes the prob­lem slightly eas­ier.

We want some way to rep­re­sent a dis­tri­bu­tion over strings. One way of rep­re­sent­ing a dis­tri­bu­tion over strings S is to in­tro­duce some aux­iliary data A and func­tion , and rep­re­sent a dis­tri­bu­tion over A. For ex­am­ple, if the model for gen­er­at­ing strings is a PCFG, then the aux­iliary data A is the ac­tual parse tree. We could as­sume that is rep­re­sented by a bit-by-bit pre­dic­tor (which takes a pre­fix of the data A and re­turns a prob­a­bil­ity for the next bit). In the PCFG ex­am­ple, this is easy; just or­der the data A in a “gen­er­a­tive” or­der, and these bit-by-bit pre­dic­tions will be easy.

Given the ac­tual string S, it is pos­si­ble to show a lower bound on this string’s marginal prob­a­bil­ity un­der by im­por­tance sam­pling pos­si­ble parse trees con­sis­tent with a given string. That is, the sys­tem will rep­re­sent some dis­tri­bu­tion over A (af­ter see­ing the string S), which only out­puts A val­ues for which . This sam­pler can be used to es­ti­mate the to­tal prob­a­bil­ity un­der of val­ues for which .

Note that this is only one method of es­ti­mat­ing the prob­a­bil­ity. Another method is model-count­ing us­ing hash func­tions.

So in or­der to pre­dict a hu­man-pro­duced string, the AI:

  1. out­puts a dis­tri­bu­tion over A, as a bit-by-bit gen­er­a­tive model

  2. sees the hu­man-pro­duced string S

  3. out­puts the dis­tri­bu­tion over A, as a bit-by-bit gen­er­a­tive model

  4. gets a score ac­cord­ing to the es­ti­mated log-prob­a­bil­ity of the string S ac­cord­ing to , us­ing for im­por­tance sampling

We could rep­re­sent a bit-by-bit gen­er­a­tive model as e.g. a split copy of the AI that out­puts prob­a­bil­ities for the next bit given pre­vi­ous bits. Note that steps 2-4 only hap­pen dur­ing train­ing; dur­ing test­ing, you only need to run step 1 and then run the func­tion on the re­sult.

This kind of method seems com­pu­ta­tion­ally eas­ier than di­rectly pre­dict­ing a hu­man-pro­duced string bit-by-bit, but it might still be harder than the origi­nal ad­ver­sar­ial pro­posal.