Could someone point me to any existing articles on this variant of AI-Boxing and Oracle AGIs:
The boxed AGI’s gatekeeper is a simpler system which runs formal proofs to verify that AGI’s output satisfies a simple, formally definable. The constraint is not “safety” in general but rather is narrow enough that we can be mathematically sure that the output is safe. (This does limit potential benefits from the AGI.)
The questions about what the constraint should be remains open, and of course the fact that the AGI is physically embodied puts it in causal contact with the rest of the universe. But as a partial or short-term solution, has anyone written about it? The only one I can think of (though I can’t find the specific article) is Goertzel’s description of an architecture where the guardian component is separate from the main AGI.
The only one I can think of (though I can’t find the specific article) is Goertzel’s description of an architecture where the guardian component is separate from the main AGI
You’re probably thinking of GOLEM. The Gödel machine is another proposal along somewhat similar lines.
Some discussions more directly related to your suggestion could be:
Our proposed protocol is based on the idea of asking what
we will call ‘safe questions’ with restricted answers. We define a safe
question as one for which a human being could find an answer without
any help from superintelligence, all on his own merit. Why is this useful?
Suppose that a scientist working on a cure for cancer is looking at
two potential cures. Each one could be developed and tested without
assistance from the AI, but would take about 3 years of hard work
each. Which one to try first if the scientist believes that they have
about an equal chance of working? If a wrong option is chosen to be
tested first, humanity will still obtain a cure for cancer but it will happen
3 years later. What if we could ask the AI to suggest which option
to try first? [...]
To make sure that a potential question is safe, a group of experts
could be used to review it. The experts should all be trained AI safety
engineers, meaning that they are familiar with the design of the AI and
its confinement environment as well as the latest developments in
machine ethics (Allen et al., 2006; Anderson & Anderson, 2007; Hall,
2000; Moor, 2006; Tonkens, 2009). Experts may also need to be
trained in computer psychology a currently non-existent profession
which might become a reality in the future (Epstein, 1997). [...]
The limited outbound channel shell consist of binary (0 or 1)
answers next to original unmodified safe questions presented by the
intermediary non-intelligent software in random order as to avoid
communication via covert timing or resource channels which make
steganography possible. The output should in no way be controllable
by the AI, outside of the multiple choice menu.
“Thinking inside the box” (section 3.1.) also has some discussion about restricting an AGI’s outputs to be safe by selecting the questions to be asked appropriately. There was also some discussion along these lines in the beginning of chap. 10 of Superintelligence, e.g., running multiple copies of different Oracles and comparing their answers with each other to avoid deception—“the answers of the different oracles have to match each other” could be a simple and formally definable criteria.
Thank you, Kaj. Those references are what I was looking for.
It looks like there might be a somewhat new idea here. Previous suggestions, as you mention, restrict output to a single bit; or require review by human experts. Using multiple AGI oracles to check each other is a good one, though I’d worry about acausal coordination between by the AGIs, and I don’t see that the safety is provable beyond checking that answers match.
This new variant gives the benefit of provable restrictions and the relative ease of implementing a narrow-AI proof system to check it. It’s certainly not the full solution to the FAI problem, but it’s a good addition to our lineup of partial or short-term solutions in the area of AI Boxing and Oracle AI.
I’ll get this feedback to the originator of this idea and see what can be made of it.
Could someone point me to any existing articles on this variant of AI-Boxing and Oracle AGIs:
The boxed AGI’s gatekeeper is a simpler system which runs formal proofs to verify that AGI’s output satisfies a simple, formally definable. The constraint is not “safety” in general but rather is narrow enough that we can be mathematically sure that the output is safe. (This does limit potential benefits from the AGI.)
The questions about what the constraint should be remains open, and of course the fact that the AGI is physically embodied puts it in causal contact with the rest of the universe. But as a partial or short-term solution, has anyone written about it? The only one I can think of (though I can’t find the specific article) is Goertzel’s description of an architecture where the guardian component is separate from the main AGI.
You’re probably thinking of GOLEM. The Gödel machine is another proposal along somewhat similar lines.
Some discussions more directly related to your suggestion could be:
Roman Yampolskiy’s proposal for “safe questions”
“Thinking inside the box” (section 3.1.) also has some discussion about restricting an AGI’s outputs to be safe by selecting the questions to be asked appropriately. There was also some discussion along these lines in the beginning of chap. 10 of Superintelligence, e.g., running multiple copies of different Oracles and comparing their answers with each other to avoid deception—“the answers of the different oracles have to match each other” could be a simple and formally definable criteria.
Thank you, Kaj. Those references are what I was looking for.
It looks like there might be a somewhat new idea here. Previous suggestions, as you mention, restrict output to a single bit; or require review by human experts. Using multiple AGI oracles to check each other is a good one, though I’d worry about acausal coordination between by the AGIs, and I don’t see that the safety is provable beyond checking that answers match.
This new variant gives the benefit of provable restrictions and the relative ease of implementing a narrow-AI proof system to check it. It’s certainly not the full solution to the FAI problem, but it’s a good addition to our lineup of partial or short-term solutions in the area of AI Boxing and Oracle AI.
I’ll get this feedback to the originator of this idea and see what can be made of it.