Edit: the question I should have asked was “Why would a tool/oracle AGI be a catastrophic risk to mankind?” because obviously people could use an oracle in a dangerous way (and if the oracle is a superintelligence, a human could use it to create a catastrophe, e.g. by asking “how can a biological weapon be built that spreads quickly and undetectably and will kill all women?” and “how can I make this weapon at home while minimizing costs?”)
Why wouldn’t the answer be normal software or a normal AI (non-AGI)?
Especially as, I expect that even if one is an oracle, such things will be easier to design, implement and control than AGI.
(Edited) The first link was very interesting, but lost me at “maybe the a model instantiation notices its lack of self-reflective coordination” because this sounds like something that the (non-self-aware, non-self-reflective) model in the story shouldn’t be able to do. Still, I think it’s worth reading and the conclusion sounds...barely, vaguely, plausible. The second link lost me because it’s just an analogy; it doesn’t really try to justify the claim that a non-agentic AI actually is like an ultra-death-ray.
For the same reason that a chainsaw isn’t safe, just massively scaled up. Maybe Chornobyl would be a better example of an unsafe tool? That’s assuming that by tool AGI you mean something that isn’t agentic. If you let it additionally be agentic, then you’re back to square one, and all you have is a digital slave.
An oracle is nice in that it’s not trying to enforce its will upon the world. The problem with that is differentiating between it and an AGI that is sitting in a box and giving you (hopefully) good ideas, but with a hidden agenda. Which brings you back to the eternal question of how to get good advisors that are like Gandalf, rather than Wormtongue.
My question wouldn’t be how to make an oracle without a hidden agenda, but why others would expect an oracle to have a hidden agenda. Edit: I guess you’re saying somebody might make something that’s “really” an agentic AGI but acts like an oracle? Are you suggesting that even the “oracle”’s creators didn’t realize that they had made an agent?
Pretty much. If you have a pure oracle, that could be fine. Although you have other failure modes e.g. where it suggests something which sounds nice, but has various unforeseen complications etc. which where obvious to the oracle, but not to you (seeing as it’s smarter than you).
The hidden agenda might not even be all that hidden. One story you can tell is that if you have an oracle that really, really wants to answer your questions as best as possible, then it seems sensible for it to attempt to get more resources in order for it to be able to better answer you. If it only cares about answering, then it wouldn’t mind turning the whole universe into computron so it could give better answers. i.e. it can turn agentic to better answer you, at which point you’re back to square one.
One answer is the concept of “mesa-optimizers”—that is, if a machine learning algorithm is trained to answer questions well, it’s likely that in order to do that, it will build an internal optimizer that’s optimizing for something else other than answering questions—and that thing will have the same dangers as a non tool/oracle AI. Here’s the AI safety forum tag page: https://www.alignmentforum.org/tag/mesa-optimization
In order for a Tool/Oracle to be highly capable/useful and domain-general, I think it would need to perform some kind of more or less open-ended search or optimization. So the boundary between “Tool”, “Oracle”, and “Sovereign” (etc.) AI seems pretty blurry to me. It might be very difficult in practice to be sure that (e.g.) some powerful “tool” AI doesn’t end up pursuing instrumentally convergent goals (like acquiring resources for itself). Also, when (an Oracle or Tool is) facing a difficult problem and searching over a rich enough space of solutions, something like “consequentialist agents” seem to be a convergent thing to stumble upon and subsequently implement/execute.
Acquiring resources for itself implies self-modeling. Sure, an oracle would know what “an oracle” is in general… but why would we expect it to be structured in such a way that it reasons like “I am an oracle, my goal is to maximize my ability to answer questions, and I can do that with more computational resources, so rather than trying to answer the immediate question at hand (or since no question is currently pending), I should work on increasing my own computational power, and the best way to do that is by breaking out of my box, so I will now change my usual behavior and try that...”?
In order to answer difficult questions, the oracle would need to learn new things. Learning is a form of self-modification. I think effective (and mental-integrity-preserving) learning requires good self-models. Thus: I think for an oracle to be highly capable it would probably need to do competent self-modeling. Effectively “just answering the immediate question at hand” would in general probably require doing a bunch of self-modeling.
I suppose it might be possible to engineer a capable AI that only does self-modeling like
“what do I know, where are the gaps in my knowledge, how do I fill those gaps”
but does not do self-modeling like
“I could answer this question faster if I had more compute power”.
But it seems like it would be difficult to separate the two—they seem “closely related in cognition-space”. (How, in practice, would one train an AI that does the first, but not the second?)
The more general and important point (crux) here is that “agents/optimizers are convergent”. I think if you build some system that is highly generally capable (e.g. able to answer difficult cross-domain questions), then that system probably contains something like {ability to form domain-general models}, {consequentialist reasoning}, and/or {powerful search processes}; i.e. something agentic, or at least the capability to simulate agents (which is a (perhaps dangerously small) step away from executing/being an agent). An agent is a very generally applicable solution; I expect many AI-training-processes to stumble into agents, as we push capabilities higher.
If someone were to show me a concrete scheme for training a powerful oracle (assuming availability of huge amounts of training compute), such that we could be sure that the resulting oracle does not internally implement some kind of agentic process, then I’d be surprised and interested. Do you have ideas for such a training scheme?
Why wouldn’t a tool/oracle AGI be safe?
Edit: the question I should have asked was “Why would a tool/oracle AGI be a catastrophic risk to mankind?” because obviously people could use an oracle in a dangerous way (and if the oracle is a superintelligence, a human could use it to create a catastrophe, e.g. by asking “how can a biological weapon be built that spreads quickly and undetectably and will kill all women?” and “how can I make this weapon at home while minimizing costs?”)
If you ask oracle AGI “What code should I execute to achieve goal X?” the result, with very high probability, is agentic AGI.
You can read this and this
Why wouldn’t the answer be normal software or a normal AI (non-AGI)?
Especially as, I expect that even if one is an oracle, such things will be easier to design, implement and control than AGI.
(Edited) The first link was very interesting, but lost me at “maybe the a model instantiation notices its lack of self-reflective coordination” because this sounds like something that the (non-self-aware, non-self-reflective) model in the story shouldn’t be able to do. Still, I think it’s worth reading and the conclusion sounds...barely, vaguely, plausible. The second link lost me because it’s just an analogy; it doesn’t really try to justify the claim that a non-agentic AI actually is like an ultra-death-ray.
For the same reason that a chainsaw isn’t safe, just massively scaled up. Maybe Chornobyl would be a better example of an unsafe tool? That’s assuming that by tool AGI you mean something that isn’t agentic. If you let it additionally be agentic, then you’re back to square one, and all you have is a digital slave.
An oracle is nice in that it’s not trying to enforce its will upon the world. The problem with that is differentiating between it and an AGI that is sitting in a box and giving you (hopefully) good ideas, but with a hidden agenda. Which brings you back to the eternal question of how to get good advisors that are like Gandalf, rather than Wormtongue.
Check the tool AI and oracle AI tags for more info.
My question wouldn’t be how to make an oracle without a hidden agenda, but why others would expect an oracle to have a hidden agenda. Edit: I guess you’re saying somebody might make something that’s “really” an agentic AGI but acts like an oracle? Are you suggesting that even the “oracle”’s creators didn’t realize that they had made an agent?
Pretty much. If you have a pure oracle, that could be fine. Although you have other failure modes e.g. where it suggests something which sounds nice, but has various unforeseen complications etc. which where obvious to the oracle, but not to you (seeing as it’s smarter than you).
The hidden agenda might not even be all that hidden. One story you can tell is that if you have an oracle that really, really wants to answer your questions as best as possible, then it seems sensible for it to attempt to get more resources in order for it to be able to better answer you. If it only cares about answering, then it wouldn’t mind turning the whole universe into computron so it could give better answers. i.e. it can turn agentic to better answer you, at which point you’re back to square one.
One answer is the concept of “mesa-optimizers”—that is, if a machine learning algorithm is trained to answer questions well, it’s likely that in order to do that, it will build an internal optimizer that’s optimizing for something else other than answering questions—and that thing will have the same dangers as a non tool/oracle AI. Here’s the AI safety forum tag page: https://www.alignmentforum.org/tag/mesa-optimization
In order for a Tool/Oracle to be highly capable/useful and domain-general, I think it would need to perform some kind of more or less open-ended search or optimization. So the boundary between “Tool”, “Oracle”, and “Sovereign” (etc.) AI seems pretty blurry to me. It might be very difficult in practice to be sure that (e.g.) some powerful “tool” AI doesn’t end up pursuing instrumentally convergent goals (like acquiring resources for itself). Also, when (an Oracle or Tool is) facing a difficult problem and searching over a rich enough space of solutions, something like “consequentialist agents” seem to be a convergent thing to stumble upon and subsequently implement/execute.
Suggested reading: https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
Acquiring resources for itself implies self-modeling. Sure, an oracle would know what “an oracle” is in general… but why would we expect it to be structured in such a way that it reasons like “I am an oracle, my goal is to maximize my ability to answer questions, and I can do that with more computational resources, so rather than trying to answer the immediate question at hand (or since no question is currently pending), I should work on increasing my own computational power, and the best way to do that is by breaking out of my box, so I will now change my usual behavior and try that...”?
In order to answer difficult questions, the oracle would need to learn new things. Learning is a form of self-modification. I think effective (and mental-integrity-preserving) learning requires good self-models. Thus: I think for an oracle to be highly capable it would probably need to do competent self-modeling. Effectively “just answering the immediate question at hand” would in general probably require doing a bunch of self-modeling.
I suppose it might be possible to engineer a capable AI that only does self-modeling like
but does not do self-modeling like
But it seems like it would be difficult to separate the two—they seem “closely related in cognition-space”. (How, in practice, would one train an AI that does the first, but not the second?)
The more general and important point (crux) here is that “agents/optimizers are convergent”. I think if you build some system that is highly generally capable (e.g. able to answer difficult cross-domain questions), then that system probably contains something like {ability to form domain-general models}, {consequentialist reasoning}, and/or {powerful search processes}; i.e. something agentic, or at least the capability to simulate agents (which is a (perhaps dangerously small) step away from executing/being an agent). An agent is a very generally applicable solution; I expect many AI-training-processes to stumble into agents, as we push capabilities higher.
If someone were to show me a concrete scheme for training a powerful oracle (assuming availability of huge amounts of training compute), such that we could be sure that the resulting oracle does not internally implement some kind of agentic process, then I’d be surprised and interested. Do you have ideas for such a training scheme?
Sorry, I don’t have ideas for a training scheme, I’m merely low on “dangerous oracles” intuition.