A possible training procedure for human-imitators

Some proposed ways to train a system to imitate a human imitate a human involve having one system imitate a human, while another system tries to tell the first system apart from an actual human. If the first system can get classified as human by the second system, then (one might think) it is imitating a human well, as long as the second system is sufficiently smart.

I described a possible problem with these approaches in a LessWrong thread:

Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).

As Paul says later in the thread, the underlying problem is that it is easy for the first AI to change its output in an important way without the second AI noticing.

Here’s a proposed way of implementing human-imitation while avoiding this problem. The human-imitator is a single AI that is given a prefix of a string produced by a human and predicts the next bit. It is rewarded for predicting the next bit accurately (using some proper scoring rule). We can use this system to imitate a human by sampling from its implied probability distribution over strings, bit by bit. If the system is very good at predicting the next bit in a human-produced string, then this implied probability distribution will be an accurate prediction of the human string.

Unfortunately, predicting the string bit-by-bit might be computationally harder than producing a string that is hard to tell apart from what a human would produce. Here’s an extension that makes the problem slightly easier.

We want some way to represent a distribution over strings. One way of representing a distribution over strings S is to introduce some auxiliary data A and function $f : A \to S$ , and represent a distribution $p$ over A. For example, if the model for generating strings is a PCFG, then the auxiliary data A is the actual parse tree. We could assume that $p$ is represented by a bit-by-bit predictor (which takes a prefix of the data A and returns a probability for the next bit). In the PCFG example, this is easy; just order the data A in a “generative” order, and these bit-by-bit predictions will be easy.

Given the actual string S, it is possible to show a lower bound on this string’s marginal probability under $p$ by importance sampling possible parse trees consistent with a given string. That is, the system will represent some distribution $q$ over A (after seeing the string S), which only outputs A values for which $f (A) = S$ . This sampler can be used to estimate the total probability under $p$ of $A$ values for which $f (A) = S$ .

Note that this is only one method of estimating the probability. Another method is model-counting using hash functions.

So in order to predict a human-produced string, the AI:

outputs a distribution $p$ over A, as a bit-by-bit generative model
sees the human-produced string S
outputs the distribution $q$ over A, as a bit-by-bit generative model
gets a score according to the estimated log-probability of the string S according to $p$ , using $q$ for importance sampling

We could represent a bit-by-bit generative model as e.g. a split copy of the AI that outputs probabilities for the next bit given previous bits. Note that steps 2-4 only happen during training; during testing, you only need to run step 1 and then run the function $f$ on the result.

This kind of method seems computationally easier than directly predicting a human-produced string bit-by-bit, but it might still be harder than the original adversarial proposal.