It’s unclear to me how this is different from other boxing designs which merely trade some usefulness for safety. Therefore, like the other boxing designs, I don’t think this is a long term solution. There isn’t an obvious question that, if we could just ask an Oracle AI, the world would be saved. For sure, we should focus on making the first AGIs safe, and boxing methods may be a good way to do this. But creating AI’s with epistemic design flaws seems like a risky solution. There are potentially many ways that, if the AI ever got out of the box, we would see malignant instantiations due to its flawed understanding of the world.
Honestly I’m not sure Oracles are the best approach either, but I’ll push the Pareto frontier of safe AI design wherever I can.
Though I’m less worried about the epistemic flaws exacerbating a box-break—it seems an epistemically healthy AI breaking its box would be maximally bad already—but more about the epistemic flaws being prone to self-correction. For instance, if the AI constructs a subagent of the ‘try random stuff, repeat whatever works’ flavor.
On the other hand, it’s plausible that computational complexity limitations mean that any cognitive system will always have some epistemic flaws, and it’s more of a question of which ones. (that said, of course there can be differences in how large the flaws are)
You won’t be horrified if you’re dead. More seriously though, if we got an Oracle AI that understood the intended meaning of our questions and did not lie or decieve us in any way, that would be an AI-alignment complete problem—in other words, just as hard as creating friendly AI in the first place.
It’s unclear to me how this is different from other boxing designs which merely trade some usefulness for safety. Therefore, like the other boxing designs, I don’t think this is a long term solution. There isn’t an obvious question that, if we could just ask an Oracle AI, the world would be saved. For sure, we should focus on making the first AGIs safe, and boxing methods may be a good way to do this. But creating AI’s with epistemic design flaws seems like a risky solution. There are potentially many ways that, if the AI ever got out of the box, we would see malignant instantiations due to its flawed understanding of the world.
Honestly I’m not sure Oracles are the best approach either, but I’ll push the Pareto frontier of safe AI design wherever I can.
Though I’m less worried about the epistemic flaws exacerbating a box-break—it seems an epistemically healthy AI breaking its box would be maximally bad already—but more about the epistemic flaws being prone to self-correction. For instance, if the AI constructs a subagent of the ‘try random stuff, repeat whatever works’ flavor.
On the other hand, it’s plausible that computational complexity limitations mean that any cognitive system will always have some epistemic flaws, and it’s more of a question of which ones. (that said, of course there can be differences in how large the flaws are)
“How do I create a safe AGI?”
Edit: Or, more likely, “this is my design for an AGI, (how) will running this AGI result in situations that I would be horrified by if they occure?”
You won’t be horrified if you’re dead. More seriously though, if we got an Oracle AI that understood the intended meaning of our questions and did not lie or decieve us in any way, that would be an AI-alignment complete problem—in other words, just as hard as creating friendly AI in the first place.