Steven Byrnes comments on AI safety without goal-directed behavior

Steven Byrnes 30 Oct 2019 13:57 UTC
5 points
0
If we’re making a list of models for non-goal-directed AI (and we should!!), I would propose two more:
- Non-consequentialist oracle AI: An oracle with the property that the algorithm will not think through the consequences of its own outputs. You ask it a question, it digs through its world-model for a fixed number of computation steps and spits out its best-guess answer, but crucially it does not try to model the causal effects of that output. (Contrast with Eliezer’s side-comment about an oracle here, which he of course assumes will be goal-directed, with the goal of “increase the correspondence between the user’s belief about relevant consequences and reality”.) A non-consequentialist oracle could never be deceptive or manipulative, because deception and manipulation require modeling the causal effects of outputs. I’ve speculated a bit about how to build something like this but it’s still definitely an open question.
- Interpretable-world-model as AI: Kinda related to the first. Imagine you take an AGI that deeply understands the world, you extract its world-model, and you have a way to browse it—like the world-model is somehow 100% super-easily interpretable. What causes Alzheimers? Well, you would go to the Alzheimers entry of the world-model, and you’ll find a beautiful way of thinking about Alzheimers in terms of these other three concepts, which in turn refer to other concepts etc. What would happen if we started a political movement against squirrels? Well, through the world-model interface, we can throw that hypothetical scenario at other entities in the world-model (people, journalists, politicians) and see what the predicted effects are. My intuition here is: (1) It’s nice to have a map when you’re traveling, (2) It’s nice to have wikipedia when you’re learning, (3) it would be nice to have a crystal ball when you’re planning … Maybe there’s some way to build a system that combines all those things and more, but is still fundamentally tool-ish? My impression is that something like this is at the core of the Kurzweil-ish vision of how brain-computer interfaces are going to solve the problem of AGI safety (see also waitbutwhy on Neuralink). (Needless to say, it’s possible to try to implement this vision without brain-computer interfaces, and vice-versa.)