The AI That Pretends To Be Human

The hard part about containing AI, is restricting it’s output. The AI can lie, manipulate, and trick. Some speculate that it might be able to do far worse, inventing infohazards like hypnosis or brain hacking.

A major goal of the control problem is preventing AIs from doing that. Ensuring that their output is safe and useful.

Awhile ago I wrote about an approach to do this. The idea was to require the AI to use as little computing power as it needed to perform a task. This prevents the AI from over-optimizing. The AI won’t use the full power of superintelligence, unless it really needs it.

The above method isn’t perfect, because a superintelligent AI may still have super abilities, even when it has a limited compute budget. And it will still try to do all those bad things mentioned above, unless it can find a better way that uses less computing power.

We don’t really want to minimize the AIs compute budget. We just want it to act more human-like. When we ask it to solve a problem, we want it to try to solve the problem as a human would solve it. All of the problems arise from the AI behaving totally different than a human. Not necessarily from it using more computing power.

Is there any way to make an AI act like a human? Or is that as impossible as just solving FAI?

Stuart Armstrong posted an idea on how to achieve that. It involves an AI that tries to mimic output that is indistinguishable from what a human would produce, but conditional on it also solving the problem. However his idea also involves whole brain emulation and a complicated method for the AI to predict the future and avoid changing it too much.

We don’t need that. The idea for the AI to try to be indistinguishable from a human is actually a really interesting control idea. It significantly limits the AI’s ability to Do Sneaky Things, since humans don’t do those things.

Here is my idea on how to achieve this. It’s the same as my limited computing power idea mentioned earlier, but applied to “appearing human” as a goal, instead of minimizing computing power.

We start with two AIs. The first AI tries to appear human. The second AI does a sort of Turing test, and tries to tell the AI apart from an actual human.

The human AI’s first goal is to produce a correct solution. It will predict how likely it’s solution is to be accepted. And it will try to produce a solution that has a >50% chance of working. This is a satisficing goal. That is, it doesn’t optimize without end to create the best solution possible, it just has to meet some minimum threshold, then stop.

It’s second goal will be to fool the second AI. It will do everything in it’s power to make the solution appear like it was produced by a human.

The second AI tries to predict how likely it’s input was produced by a human. It is also conditioned on the input being a solution to the problem, so the fact the AI solved a problem that was really hard for humans does not count against it.

This prevents the first AI from doing evil things with it’s output. If it tries to insert complicated infohazards or subagents into it’s output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that the first AI only cares about the outcome of this test. The test doesn’t need to actually be performed, it can just be a threat that it is conditioned on. (The ability to condition AIs on acting as if they were in counterfactual worlds is a critical piece of this, and deserves it’s own writeup somewhere. It’s an important tool in the FAI toolkit. It’s a relatively simple method that can work even on current AIs.)

Another point is the second AI should have more computing power than the first AI. That way the first AI can’t get a slight advantage over it and be able to hack it or manipulate it somehow. Instead it has to focus it’s efforts on being indistinguishable from a human, even by an agent smarter than itself.

This method also relies on AI boxing. The first AI needs to be contained in a box so that it’s only output goes to a text terminal or the second AI. However it doesn’t require any of the more complicated machinery required for other reduced impact methods.

The AIs also aren’t agents, in the sense they don’t need to maximize any long term utility function. The first AI only cares about answering the question it has been asked, and the second AI only does prediction.


An example use of this is to create an oracle AI that answers your questions. The human judges somehow whether the answer is correct, or at least good. The AI tries to produce answers which the human judge will probably like, but which also are indistinguishable from answers humans would produce.

Such an oracle would be tremendously useful. The human could ask the AI to produce new AI and FAI papers, which would help immensely speed up our progress towards true FAI.

Similar techniques could be applied to very different domains. For example controlling a robot. The AI will try to produce outputs that look like a human is controlling it.

This would require some modifications to the above method, so that the AI doesn’t try to use the robot to hack it’s own brain or anything. This could be done by conditioning it on a counterfactual world where it’s outputs are going to the second AI rather than a real robot.

Another point is the AI doesn’t need to mimic average humans given average resources. It could be conditioned on the human having had tons of time to come up with an answer. E.g. producing an answer that a human would have come up with given a year. Or controlling the robot the same way as a human given tons of time to practice, or in a very slowed down simulation.


I would like to note a parallel with a method in current AI research, Generative Adversarial Networks. Generative Adversarial Networks work by two AIs, one which tries to produce an output that fools the second AI, and the other which tries to predict which samples were produced by the first AI, and which are part of the actual distribution.

It’s quite similar to this. GANs have been used successfully to create images that look like real images, which is a hard problem in AI research. In the future GANs might be used to produce text that is indistinguishable from human (the current method for doing that, by predicting the next character a human would type, is kind of crude.)

Reposted from my blog.