A putative new idea for AI control; index here.

I originally came up with a method for—safely? -- extracting high impact from low impact AIs.

But it just occurred to me that the idea could be used for standard Oracles, in a way that keeps them arguably safe, without needing to define truth and similar. So introducing:

The Confirmed Selective Oracle!

A Selective Oracle is one that can’t give us a general answer to a question, but instead has to choose among a (small) variety of predetermined answers. A Confirmed Selective Oracle is one where the Oracles answer is confirmed at some point.

For instance, imagine we have 1000 different stocks, and we want to get rich. We ask the Oracle which one will increase by the most in the coming year. The Oracle’s answer is simply the name of one of the stocks (selective). In a year’s time, we give the Oracle a number, which is simply the rank of the stock selected at the end of the year, among the 1000 (confirmation).

The Oracle’s objective is simply to minimise the number it gets. If we design the physical setup well, and the AI knows all the details of it, we could get away with simply motivating it to minimise that number, and not bother with any concerns of truth or accuracy.

We could then reset the Oracle, and run it on various different questions of the same type, to serve as a general information extractor or to train the Oracle.

Thoughts and concerns

The design requires the Oracle be physically contained, with no back-channels of communication. This is very important; the AI can attempt to break out or manipulate us, because its goal—minimising a number we input—is very unsafe. That means no clarification questions from the Oracle, no interaction, very minimal observation of it.

It also requires that the Oracle be reset perfectly, and not share some timeless goal with different copies of itself.

We have to keep the set of possible answers low, while being aware that if one of the answers leads to the AI being unboxed and taking over the world, the AI will select that one.

Finally, we need to note that the Oracle’s answers are self-consistent with being known. For instance, in the stock market example, we are asking it to choose the highest growth stock, given its answer and all consequences of that answer (such as us buying that stock, the Oracle’s answer leaking, etc...). Since we are going to confirm the answer at the end of a year, after seeing the answer, this self-consistency is required for a useful Oracle.

It might well be that the Oracle can efficiently select the highest stock if we didn’t know its answer, but that our actions upon hearing the answer mean that it can only get us in the top 10. This is fine—will still try and do the best it can, given this constraint. As long as we choose the task well, and it isn’t actually anti-inductive, we can still get great power from the CSO.