Different goals may bring AI into conflict with us
Context: This is a linkpost for https://aisafety.info/questions/NM3H/7:-Different-goals-may-bring-AI-into-conflict-with-us
This is an article in the new intro to AI safety series from AISafety.info. We’d appreciate any feedback. The most up-to-date version of this article is on our website.
Aligning the goals of AI systems with our intentions could be really hard. So suppose we fail, and build a powerful AI that pursues goals different from ours. You might think it would just go off and do its own thing, and we could try again with a new AI.
But unfortunately, according to the idea of instrumental convergence, almost any powerful system that optimizes the world will find it useful to pursue certain kinds of strategies. And these strategies can include working against anything that might interfere.
For example, as a thought experiment, consider Peelie, a robot that cares only about peeling oranges, always making whatever decision results in maximum peeled oranges. Peelie would see reasons to:
Remove nearby objects that might block its supply of oranges.
Acquire resources that it can use to accomplish its goals, like money to buy oranges and knives.
Convince people to never turn it off, because if it’s turned off, it peels no oranges.
Remove anything that might change its goals, because if it started to peel lemons instead, it would peel fewer oranges.
Seize control of the building it’s in — just in case.
Seize control of the country it’s in — just in case the people there would stop it from seizing the building.
Build a smarter AI that also only cares about peeling oranges.
Hide its intentions to do these things, because if humans knew its intentions, they might try to stop it.
This is meant as an illustration, not a realistic scenario. People are unlikely to build a machine that’s powerful enough to do these things, and only make it care about one thing.
But the same basic idea applies to any strong optimizer with different goals from ours, even if they’re a lot more complex and harder to pin down. The optimizer can become more successful by contesting our control — at least, if it does so successfully.
And while people have proposed ways to address this problem, such as forbidding the AI from doing certain kinds of actions, the AI alignment research community doesn’t think the solutions we’ve considered so far are sufficient to contain superintelligent AI.
In science fiction, the disaster scenario with AI is often that it “wakes up” and becomes conscious, and starts hating humans and wanting us to be harmed. That’s not the main disaster scenario in real life. The real-life concern is simply that it will become very competent at planning, and remove us as a means to an end.
Once it’s much smarter than us, it may not find that hard to accomplish.
What if Peely had a secondary goal to not harm humans? What is stopping it from accomplishing goal number 1 in accordance with goal number 2? Why should we assume that a superintelligent entity would be incapable of holding multiple values?
A key question is if the typical goal-directed superintelligence would assign any significant value to humans. If it does, that greatly reduces the threat from superintelligence. We have a somewhat relevant article earlier in the sequence: AI’s goals may not match ours.
BTW, if you’re up for helping up improve the article, would you mind answering some questions? Like: do you feel like our article was “epistemically co-operative”? That is, do you think it helps readers orient themselves in the discussion on AI safety, makes the assumptions clear, and generally tries to explain rather than persuade? What’s your general level of familiarity with AI Safety?