A Safer Oracle Setup?

Edit: I no longer think this post deserves attention.

Edit: Invoking a black-box agent is probably a very bad idea unless there’s a consensus in the AI safety community it should be done.

This post describes a setup that allows asking a super-intelligent agent a question, and getting an answer that humans with an unbounded amount of time could come up with (Edit: actually, if the question is “too easily distinguishable” from questions we already know the answer to, it’s plausible we won’t get a useful answer). The setup is not restricted to questions with a single or precise answer, and there is no restriction on the length of the answer. This setup might be safer than the naive one—let me know what you think (and let me know if it’s all already covered by existing work).

The Setup

First, we build a big dataset of (question, answer) pairs:

The questions are of the following form: “Limited to 100 words, what might humans answer in 2018 when asked how people in 1950 could most effectively improve their wellbeing?” (Edit: this is a bad example, see first edit)

The answer in this example would be a 100 word text, that might contain the phrase “stop smoking”. The questions and answers can be written by any “regular” crowd of people (e.g. MTruk workers).

We then create a super-intelligent agent, denoted X, that has the entire dataset written in memory at invocation time. The agent X is in a box and has no input channel. The goal of X is to find an output string that is at most characters long (where is also written in X’s memory at invocation time). The string is the code of another agent, denoted Y.

Agent Y also has no input channel, and when invoked—some question should be written in its memory. Its output is a probability distribution over all possible answers for (we assume that an upper-bound on the answer length is define by the question).

The reward of X is the log-likelihood of the answers as defined by the output of Y, meaning:

where is the probability that the output of Y assigns to .

We then follow the following algorithm:

  1. Create an agent X with our dataset and in memory, and invoke it.

  2. Take the output of X, which is the code of an agent Y, and invoke Y, each time with a new question (not from the original dataset). The questions should be important for us and ones that we don’t know the answer to, but future people would. For example: “Limited to 100 words, what might humans answer in 2100 when asked how people in 2018 could most effectively improve their wellbeing?” (Edit: this is a bad example, see first edit)

  3. If the answers with highest probability according to the outputs of Y don’t seem useful then increase by 10% and go back to step 2. The dataset must be large enough with respect to , such that the answers won’t simply be encoded in the code of Y. If necessary, add more pairs to the dataset.

ETA: The agent Y is able to distinguish between examples from the dataset and examples that will only be solvable by humans much later. That’s problematic because for any of the latter we plausibly won’t get a useful output from Y. This post initially included an incorrect note that intended to “fix” this (removed). After getting helpful comments from Nicolas Tymmesa and Stuart Armstrong I came to suspect the following claim is true:

Suppose we sample a question from a new set of questions that are not in the dataset. When using agent Y to get an answer to the question, we might expect to get a useful answer only if there does not exist a program that it’s code length is the same as the code length of Y, and that can determine whether a question belongs to the original dataset or to the new set.

I think it’s plausible that with a large-enough dataset, the smallest n for which Y yields a reasonable performance on the dataset is small enough such that the above condition holds for a useful set of new questions. Note that after using Y to get answers to questions in the new set, we can then immediately add the new pairs to the dataset and repeat the entire process, now getting answers to a new set of questions, and so on. I wonder if it’s plausible that such an iterative process will end up giving us the most-important answers.

ETA: There might be a moral problem with using this setup in case Y simulates sentient beings, which might be plausible given that it needs to predict the answer that sentient beings come up with. Actually, we ourselves might be serving that purpose for some Y :)

Why Might this Setup be Safer than the Naive One?

When X is invoked, it’s expected reward given an output does not depend on the state of the universe. Therefore, it might be plausible that X won’t seek to manipulate anyone, etc. ETA: Wrong! With the “goal system” I had in mind, the agent X might pre-commit to act as follows: check whether a string s is already magically written in some specific place in memory. If so, write s to the output channel and terminate. Otherwise write to the output channel something that will cause the creation of a new agent Z that has a lot more computation power than X. Agent Z will find a better possible-output-for-X than X ever could, and will then invoke many executions of X, writing the great-output-for-X to the specific place where X will look for it. I need to think more about this.

But the Y agent that we get from X seems like a different story—we don’t know anything about the goal system of Y (if it even has anything remotely like goal systems that we can imagine). If such a setup is used in a Hail Mary scenario, we can only hope that the shortest code that superintelligence can find for answering our questions won’t try to, for example, make us re-invoke it many times with easy questions in memory.

Edited to remove an incorrect note.