In­formed oversight

WikiLast edit: 24 Mar 2016 1:50 UTC by jessicat

It can be difficult for one agent to oversee another agent that is maximizing a reward controlled by the first agent, even if the first agent is smarter than the second agent. This is a serious obstacle to Approval directed agents and other approaches based on reinforcement learning.

More formally, suppose a strong agent B with a utility function U is overseeing a weak agent A. Agent A produces an action y and gets a reward B(y) determined by agent B. B wants to give rewards such that agent A is incentivized to make its best effort at maximizing U(a).

If B(y) is an efficient estimate of U(y) relative to A, then indeed maximizing B(y) is equivalent to maximizing U(y) from A’s perspective. Unfortunately, it is difficult for this to be the case even when B is smarter than A, because A may think thoughts B does not in the course of selecting y. Here are a couple concrete examples (each having the flavor of “easy to generate a solution, hard to check a solution”):

Here are a couple analogous examples that seem easier to analyze:

  1. Let O be a one-way function (e.g. SHA256). U assigns a high value to binary strings y equal to O(111x) for some x. B cannot tell whether a particular string y is equal to O(111x) for some x (even if B is smarter than A), so B has no good way of giving rewards to A.

  2. Same as the previous problem, except U assigns a high value to binary strings y that are not equal to O(111x) for any short string x. A’s intended strategy is to generate a uniformly random string (or generate a string in some other way that doesn’t introduce a bias towards generating strings of the form O(111x)), but B has no way of knowing that A’s string was uniformly random (rather than equal to O(111x) for a uniformly random x).

One attempted solution is for agent B to see A’s thought processes in addition to the action y. Then, agent B can use all the information A has, so that B(y) is an efficient estimate of U(y) from the perspective of A. However, this requires B to understand something about A’s mental architecture. It would be good to have a more generic solution that requires only weak assumptions about A’s mental architecture.

An attempted solution to problem 2 is for B to incentivize A to generate its action in a provably random manner. However, this doesn’t appear to generalize well to problems significantly different from 2.

Further reading

Paul Christiano on adequate oversight

Paul Christiano on the informed oversight problem