We could start from simple problems, progress to more complex ones, and gradually learn about the shortcuts AI could make, and develop related heuristics. This seems like a good plan, though there are a few things that could make it difficult:
Maybe we can’t develop a smooth problem complexity scale with meaningful problems along it. (If the problem is not meaningful, it will be diffiult to tell the difference between reasonable and unreasonable solution. Or a heuristics developed on artificial problems may fail on an important problem.) The set of necessary heuristics may grow too fast; maybe for twice more complex problem we will need ten times more safeguards, which will make progress beyond some boundary extremely slow; and some important tasks may be far beyond that boundary. Or at some moment we may be unable to produce the necessary heuristic, because the heuristic itself would be too complex.
But generally, the first good use of the superhuman AI is to learn more about the superhuman AI.
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn’t make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.
I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased.
I was thinking about something like this:
A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all—but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.
A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of “creatively different” Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.
This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.
On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the “do not want” exceptions to the model. Maybe having a decent set of realistic “do not want” exceptions will help us discover what we really want. (By realistic I mean: really generated by AI’s attempts for a simple solution, simple as in Occam’s razor; not just pseudosolutions generated by an armchair philosopher.)
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.
We could start from simple problems, progress to more complex ones, and gradually learn about the shortcuts AI could make, and develop related heuristics. This seems like a good plan, though there are a few things that could make it difficult:
Maybe we can’t develop a smooth problem complexity scale with meaningful problems along it. (If the problem is not meaningful, it will be diffiult to tell the difference between reasonable and unreasonable solution. Or a heuristics developed on artificial problems may fail on an important problem.) The set of necessary heuristics may grow too fast; maybe for twice more complex problem we will need ten times more safeguards, which will make progress beyond some boundary extremely slow; and some important tasks may be far beyond that boundary. Or at some moment we may be unable to produce the necessary heuristic, because the heuristic itself would be too complex.
But generally, the first good use of the superhuman AI is to learn more about the superhuman AI.
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn’t make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.
I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?
I was thinking about something like this:
A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all—but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.
A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of “creatively different” Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.
This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.
On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the “do not want” exceptions to the model. Maybe having a decent set of realistic “do not want” exceptions will help us discover what we really want. (By realistic I mean: really generated by AI’s attempts for a simple solution, simple as in Occam’s razor; not just pseudosolutions generated by an armchair philosopher.)
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.