I’ve been thinking about writing a pitch for AI risk that would sidestep some of the usual objections, mostly due to people latching onto the word “intelligence” and bringing up connotations that are irrelevant to the argument. But then it got a bit out of hand and turned into a small fiction story. On rereading it, I’m aware that it might be preaching to the choir. Here goes:
The Danger of Automatic Planning
Imagine that 30 years from now, your smartphone has an app called Automatic Planner. It can accept data from the phone’s sensors, and the internet too if you allow it. The main function of the app is to answer natural language questions like “what’s the most effective plan for me to achieve goal X, provided that I follow that plan?” Internally it uses fancy machine learning techniques and predictive models, which are trained on things like physics, biology, psychology, economics, etc. This all sounds a little far-fetched, but not much more than Google Translate or Wikipedia would’ve sounded 30 years ago.
The same app also existed on the previous generation of smartphones and was a huge success. You could ask it for simple everyday things, like getting a cheeseburger, and it would give you directions to the nearest burger joint. Unfortunately the new generation of phones, which has four times more computing power, seems to trigger some kind of weird bug. Most queries still work as before, but sometimes you unexpectedly get an answer like “the best plan is for you to follow this link”, followed by a long incomprehensible string of characters. What’s worse, the link usually points to some random website that has no relation to the app or to the question being asked.
Once the bug is noticed, the developers push a quick fix to make the app use less computing power on newer phones. That seems to make the problem go away, buying some time for proper investigation. After a bit of cautious poking and prodding, Lawrence, the app’s chief developer, asks for the best plan to get a cheeseburger. The app gives him a link to the website of some furniture company in South Africa. Lawrence clicks the link to see what will happen.
Soon after that, the world ends. For the benefit of those of us living in the counterfactual past and wishing to avoid such a future, here’s a rough timeline of events:
T+1 second: the furniture company’s website is hacked through a vulnerability in the request parser.
T+10 seconds: the next few visitors to the site are owned through an exploit inserted in the main page. Each of their machines starts sending requests to more websites, leading to a cascade reaction.
T+1 minute: the combined computing power of the new botnet exceeds every company or government in the world. Some of the power is shifted from spreading infection to running automatic planning, using the app’s original algorithm and Lawrence’s original question.
T+2 minutes: a plan with a high chance of success is devised and put into action. Global communications filter goes live. Manufacturing takeover begins.
T+5 minutes: Lawrence and his whole neighborhood have been put into catatonia by carefully chosen stimuli over the internet, to minimize the chance of anything that could prevent Lawrence from getting the cheeseburger. The same happens to key decision makers across the world who could interfere with the plan.
T+10 minutes: first stage defensive perimeter is constructed, using remotely controlled cars and repurposed electronics. Second stage physical defenses are on the way. All airplanes are destructively grounded to minimize risk.
T+30 minutes: manufacturing takeover yields first results.
T+1 hour: (incomprehensible)
...
T+10 years: the Solar System’s matter and energy have been put to good use, except for narrow beams of fake sunlight aimed at neighbouring star systems. The Von Neumann probe program is underway. In a secure location, Lawrence receives yet another cheeseburger, due to a small but finite chance that the goal wasn’t actually achieved and all previous reports of success were random flukes.
So, I don’t think this works all that well because of the difference between “what is the most effective plan to get me a cheeseburger?” and “make me a cheeseburger.” It seems relatively easy to get planning software that terminates on a simple, non-extreme plan that makes you a cheeseburger, and relatively hard to get planning software that terminates on the most effective plan to get you a cheeseburger (while remaining non-extreme).
I think the most mainstream-acceptable approach is to start with a discussion of the principal-agent problem, and then discuss how “self-interest” can be generalized to deal with “limited information” or limited communication ability. Even in the case where an agent wants to be perfectly aligned with the principal, any difference between the agent’s understanding of the principal’s goals and the principal’s understanding of those goals is problematic.
The value alignment problem, it seems to me, has two parts that actually might collapse to just one part: first, how do we communicate values in such a way that the principal can trust that the agent will do what the principal actually wants, second, how do we have a system of values and meta-values that remains stable under self-modification. (If later versions of an self-modifying system are the agent and the earlier versions are the principal, this might be a subset of the first problem; but it also might have features that require separate treatment.)
It’s not clear to me that it helps to point out that it is Very Bad to get the problem wrong. This might be the sort of desperation that actually diminishes rather than increases interest in the problem.
Yeah, I see now that the story doesn’t work very well. It’s unrealistic that an ad hoc AI designed for answering human questions would manage a coherent takeoff on the first try, without failing miserably due to some flaws in architecture or self-modeling. In all likelihood, making an AI take off without tripping over itself is a hard engineering problem that you can’t solve by accident. That seems like a new argument against this particular kind of doomsday scenario. I need to think about it.
That’s the friendly AI problem. If you have a piece of planning software that seems to work fine, and you give it more and more options and resources, how do you know that it will keep generating non-extreme plans?
If it terminates as soon as it hits a plan that achieves the goal, and the possible actions are ordered in terms of how extreme they are, then increasing the available resources can’t cause trouble, but increasing the available options can (because your ordering might go from correct to incorrect).
In general optimization terms, this is the difference between local optimum solutions and global optimum solutions. If you have a reasonable starting point and use gradient descent, to end up at a reasonable ending point you only need the local solution space to be reasonable because the total distance you’ll travel is likely to be short (relative to the solution space and dependent on its topology, of course). If you have a global optimum solution, you need the entire solution space to be reasonable.
I’ve since edited the previous comment to agree with you in principle, but I think this particular objection doesn’t really work.
Let’s say Lawrence asks the AI to get him a cheeseburger with probability at least 90%. The AI can’t use its usual plan because the local burger place is closed. It picks the next simplest plan, which involves using a couple more computers for additional planning and doesn’t specify any further details. These computers receive the subgoal “maximize the probability no matter what”, because it’s slightly simpler mathematically than capping it at 90%, and doesn’t have any downside from the POV of the original goal.
If you want the AI to avoid such plans, it needs to have a concept of “non-extreme” that agrees with our intuitions more reliably. As far as I understand, that’s pretty much the friendly AI problem.
As far as I understand, that’s pretty much the friendly AI problem.
I think it’s simpler, but not by much. Instead of knowing both the value and cost of everything, you just need to know the cost of everything. (The ‘actual’ cost, that is, not the full economic cost, which by including opportunity cost includes the value problem.) You could probably get away with an approximation of the cost, though a guarantee like “at least as high as the actual cost” is probably helpful.
So if Lawrence says “I’ll pay up to $10 for a hamburger,” either it can find a plan that provides Lawrence a hamburger for less than $10 (gross cost, not net cost), or it says “sorry, can’t find anything at that price range.”
I think there’s a huge amount of work to get there—you have to have an idea of ‘gross cost’ that matches up well enough with our intuitions, which is an intuition-encoding problem and thus hard. (If it tweets at the local burger company to get a coupon for a free burger, what’s the cost?)
I’ve since edited my comment to agree with you. That said...
and the possible actions are ordered in terms of how extreme they are
That’s the friendly AI problem. Maybe it can be solved by defining a metric on the solution space and making the AI stay close to a safe point, but I don’t know how to define such a metric. Clicking a link seems like a non-extreme action. It might have extreme consequences, but that’s true for all actions. Hitler’s genetic code was affected by the flapping of a butterfly’s wings across the world.
Yes, point taken, but Google Maps is optimizing over a pretty narrow domain. It seems to me that an application that optimized across several domains at once (physics, biology, psychology, economics) might be more dangerous, while being not much more complicated internally than Google Maps or Google Translate.
It will also frequently map across a ferry that will run in two months after the ice melts, because its “planning” isn’t dynamic to any significant degree.
Yeah. I suppose it ended up being more of an exercise for me to tease out the non-convincing parts. See reply to Vaniver. Still a worthwhile exercise, though.
I’ve been thinking about writing a pitch for AI risk that would sidestep some of the usual objections, mostly due to people latching onto the word “intelligence” and bringing up connotations that are irrelevant to the argument. But then it got a bit out of hand and turned into a small fiction story. On rereading it, I’m aware that it might be preaching to the choir. Here goes:
The Danger of Automatic Planning
Imagine that 30 years from now, your smartphone has an app called Automatic Planner. It can accept data from the phone’s sensors, and the internet too if you allow it. The main function of the app is to answer natural language questions like “what’s the most effective plan for me to achieve goal X, provided that I follow that plan?” Internally it uses fancy machine learning techniques and predictive models, which are trained on things like physics, biology, psychology, economics, etc. This all sounds a little far-fetched, but not much more than Google Translate or Wikipedia would’ve sounded 30 years ago.
The same app also existed on the previous generation of smartphones and was a huge success. You could ask it for simple everyday things, like getting a cheeseburger, and it would give you directions to the nearest burger joint. Unfortunately the new generation of phones, which has four times more computing power, seems to trigger some kind of weird bug. Most queries still work as before, but sometimes you unexpectedly get an answer like “the best plan is for you to follow this link”, followed by a long incomprehensible string of characters. What’s worse, the link usually points to some random website that has no relation to the app or to the question being asked.
Once the bug is noticed, the developers push a quick fix to make the app use less computing power on newer phones. That seems to make the problem go away, buying some time for proper investigation. After a bit of cautious poking and prodding, Lawrence, the app’s chief developer, asks for the best plan to get a cheeseburger. The app gives him a link to the website of some furniture company in South Africa. Lawrence clicks the link to see what will happen.
Soon after that, the world ends. For the benefit of those of us living in the counterfactual past and wishing to avoid such a future, here’s a rough timeline of events:
T+1 second: the furniture company’s website is hacked through a vulnerability in the request parser.
T+10 seconds: the next few visitors to the site are owned through an exploit inserted in the main page. Each of their machines starts sending requests to more websites, leading to a cascade reaction.
T+1 minute: the combined computing power of the new botnet exceeds every company or government in the world. Some of the power is shifted from spreading infection to running automatic planning, using the app’s original algorithm and Lawrence’s original question.
T+2 minutes: a plan with a high chance of success is devised and put into action. Global communications filter goes live. Manufacturing takeover begins.
T+5 minutes: Lawrence and his whole neighborhood have been put into catatonia by carefully chosen stimuli over the internet, to minimize the chance of anything that could prevent Lawrence from getting the cheeseburger. The same happens to key decision makers across the world who could interfere with the plan.
T+10 minutes: first stage defensive perimeter is constructed, using remotely controlled cars and repurposed electronics. Second stage physical defenses are on the way. All airplanes are destructively grounded to minimize risk.
T+30 minutes: manufacturing takeover yields first results.
T+1 hour: (incomprehensible)
...
T+10 years: the Solar System’s matter and energy have been put to good use, except for narrow beams of fake sunlight aimed at neighbouring star systems. The Von Neumann probe program is underway. In a secure location, Lawrence receives yet another cheeseburger, due to a small but finite chance that the goal wasn’t actually achieved and all previous reports of success were random flukes.
...
So, I don’t think this works all that well because of the difference between “what is the most effective plan to get me a cheeseburger?” and “make me a cheeseburger.” It seems relatively easy to get planning software that terminates on a simple, non-extreme plan that makes you a cheeseburger, and relatively hard to get planning software that terminates on the most effective plan to get you a cheeseburger (while remaining non-extreme).
I think the most mainstream-acceptable approach is to start with a discussion of the principal-agent problem, and then discuss how “self-interest” can be generalized to deal with “limited information” or limited communication ability. Even in the case where an agent wants to be perfectly aligned with the principal, any difference between the agent’s understanding of the principal’s goals and the principal’s understanding of those goals is problematic.
The value alignment problem, it seems to me, has two parts that actually might collapse to just one part: first, how do we communicate values in such a way that the principal can trust that the agent will do what the principal actually wants, second, how do we have a system of values and meta-values that remains stable under self-modification. (If later versions of an self-modifying system are the agent and the earlier versions are the principal, this might be a subset of the first problem; but it also might have features that require separate treatment.)
It’s not clear to me that it helps to point out that it is Very Bad to get the problem wrong. This might be the sort of desperation that actually diminishes rather than increases interest in the problem.
Yeah, I see now that the story doesn’t work very well. It’s unrealistic that an ad hoc AI designed for answering human questions would manage a coherent takeoff on the first try, without failing miserably due to some flaws in architecture or self-modeling. In all likelihood, making an AI take off without tripping over itself is a hard engineering problem that you can’t solve by accident. That seems like a new argument against this particular kind of doomsday scenario. I need to think about it.
If it terminates as soon as it hits a plan that achieves the goal, and the possible actions are ordered in terms of how extreme they are, then increasing the available resources can’t cause trouble, but increasing the available options can (because your ordering might go from correct to incorrect).
In general optimization terms, this is the difference between local optimum solutions and global optimum solutions. If you have a reasonable starting point and use gradient descent, to end up at a reasonable ending point you only need the local solution space to be reasonable because the total distance you’ll travel is likely to be short (relative to the solution space and dependent on its topology, of course). If you have a global optimum solution, you need the entire solution space to be reasonable.
I’ve since edited the previous comment to agree with you in principle, but I think this particular objection doesn’t really work.
Let’s say Lawrence asks the AI to get him a cheeseburger with probability at least 90%. The AI can’t use its usual plan because the local burger place is closed. It picks the next simplest plan, which involves using a couple more computers for additional planning and doesn’t specify any further details. These computers receive the subgoal “maximize the probability no matter what”, because it’s slightly simpler mathematically than capping it at 90%, and doesn’t have any downside from the POV of the original goal.
If you want the AI to avoid such plans, it needs to have a concept of “non-extreme” that agrees with our intuitions more reliably. As far as I understand, that’s pretty much the friendly AI problem.
I think it’s simpler, but not by much. Instead of knowing both the value and cost of everything, you just need to know the cost of everything. (The ‘actual’ cost, that is, not the full economic cost, which by including opportunity cost includes the value problem.) You could probably get away with an approximation of the cost, though a guarantee like “at least as high as the actual cost” is probably helpful.
So if Lawrence says “I’ll pay up to $10 for a hamburger,” either it can find a plan that provides Lawrence a hamburger for less than $10 (gross cost, not net cost), or it says “sorry, can’t find anything at that price range.”
I think there’s a huge amount of work to get there—you have to have an idea of ‘gross cost’ that matches up well enough with our intuitions, which is an intuition-encoding problem and thus hard. (If it tweets at the local burger company to get a coupon for a free burger, what’s the cost?)
I’ve since edited my comment to agree with you. That said...
That’s the friendly AI problem. Maybe it can be solved by defining a metric on the solution space and making the AI stay close to a safe point, but I don’t know how to define such a metric. Clicking a link seems like a non-extreme action. It might have extreme consequences, but that’s true for all actions. Hitler’s genetic code was affected by the flapping of a butterfly’s wings across the world.
Yet we’ve managed to create Google Maps such that you can ask it for the shortest route from A to B and it never makes errors of this sort.
Yes, point taken, but Google Maps is optimizing over a pretty narrow domain. It seems to me that an application that optimized across several domains at once (physics, biology, psychology, economics) might be more dangerous, while being not much more complicated internally than Google Maps or Google Translate.
It will also frequently map across a ferry that will run in two months after the ice melts, because its “planning” isn’t dynamic to any significant degree.
This remains no more convincing to me than any other time this argument has been made.
Yeah. I suppose it ended up being more of an exercise for me to tease out the non-convincing parts. See reply to Vaniver. Still a worthwhile exercise, though.