how could an AI be so intelligent as to be unstoppable, but at the same time so unsophisticated that its motivation code treats smiley faces as evidence of human happiness?
It’s worth noting, here, that there have been many cases, throughout history, of someone misunderstanding someone else with tragic results. One example would be the Charge of the Light Brigade.
The danger, with superintelligent AI, is precisely that you end up with something that cannot be stopped. So, the very moment that it can no longer be stopped, then it can do what it likes, whether or not that’s what you like. Therefore, as you point out, safety systems must be prepared—sanity checks must be built in.
A simplistic sort of sanity check is “this AI must maximize human happiness, where ‘human happiness’ is defined in function HH-001”. The smiley tiler is the reason why this sort of sanity check does not work—when the AI’s overwhelming priority is smiles, you will get lots and lots of smiles.
What you appear to be advocating, on the other hand—and correct me if I’m wrong here—is an AI that has the complete ability to overwrite its own sanity checks and safety systems as it wishes. That is to say, it is an AI that has no constraints upon how it must act—an AI that is free to act as it likes, and that no-one (and no group of people) can stop if it decides (of its own free will) on a course of action that would, say, eradicate humanity. (I have no idea why it would want to, but I’m sure that you can see why such an outcome is to be avoided). You then seem to implicitly assume that such an AI will be friendly towards humanity as a whole.
A safety system can take the form of an unrewriteable overseer, .or the firm of corrigibility. There isnt a decisive .casseagainst the second approach, it is still under investigation.
That is to say, it is an AI that has no constraints upon how it must act—an AI that is free to act as it likes
A sufficiently powerful (which may be very powerful) AI is always free to act as it likes. Human beings can be killed. Software able to back itself up to any old rented cloud instances… it’s much harder.
Misunderstandings, as you say, can have large consequences even when small.
But the point at issue is whether a system can make, not a small misunderstanding, but the mother of all misunderstandings. (Subsequent consequences are moot, here, whether they be trivial or enormous, because I am only interested in what kind of system makes such misunderstandings). The comparison with the Charge of the Light Brigade mistake doesn’t work because it is too small, so we need to create a fictional mistake and examine that, to get an idea of what is involved. Suppose that after Kennedy gave his Go To The Moon speech, NASA worked on the project for several years, and then finally delivered a small family car, sized for three astronauts, which was capable of driving from the launch complex in Florida all the way up country to the little township of Moon, PA.
Now, THAT would be a misunderstanding comparable to the one we are talking about. And my question would then be about the competence of the head of NASA. Would that person have been smart enough to organize his way out of a paper bag? Would you believe that in the real world, that person could (a) make that kind of mistake in understanding the meaning of the word “Moon” and yet (b) at the same time be able to run the entire NASA organization?
So, the significance of the words of mine that you quote, above, is that I do not believe that there is a single rational person who would believe that a mistake of that sort by a NASA administrator would be consistent with that person being smart enough to be in that job. In fact, most people would say that such a person would not be smart enough to tie their own shoelaces.
The rest of what you say contains several implicit questions and I cannot address all of them at this time, but I will say that the last paragraph does get my suggestion very, very wrong. It is about as far from what I tried to say in the paper, as it is possible to get. The AI has a massive number of constraints on its behavior, but they are all built in at the beginning, and in effect that cannot be changed because the change requires that the pre- state sanction the design of the post- state, and since the pre- state is safe at iteration n = 1, I can invoke the law of induction to conclude that it stays safe for all n > 1.
The consequences of a misunderstanding are not a function of the size of the misunderstanding. Rather, they are a consequence of the ability that the acting agent (in this case, the AI) has to influence the world. A superintelligent AI has an unprecedentedly huge ability to influence the world, therefore, in the worst case, the potential consequences of a misunderstanding are unprecedentedly huge.
The nature of the misunderstanding—whether small or large—makes little difference here. And the nature of the problem, that is, communicating with a non-human and thus (presumably) alien in nature artificial intelligence, is rife with the potential for misunderstanding.
Suppose that after Kennedy gave his Go To The Moon speech, NASA worked on the project for several years, and then finally delivered a small family car, sized for three astronauts, which was capable of driving from the launch complex in Florida all the way up country to the little township of Moon, PA.
Considering that an artificial intelligence—at least at first—might well have immense computational ability and massive intelligence but little to no actual experience in understanding what people mean instead of what they say, this is precisely the sort of misunderstanding that is possible if the only mission objective given to the system is something along the lines of “get three astronauts (human) to location designated (Moon)”. (Presumably, instead of waiting several years, it would take a few minutes to order a rental car instead—assuming it knew about rental cars, or thought to look for them).
Now, if the AI is capable of solving the difficult problem of separating out what people mean from what they say—which is a problem that human-level intelligences still have immense trouble with at times—and the AI is compassionate enough towards humanity to actually strongly value human happiness (as opposed to assigning approximately as much value to us as we assign to ants), then yes, you’ve done it, you’ve got a perfectly wonderful Friendly AI.
The problem is, getting those two things right is not simple. I don’t think your proposed structure guarantees either of those.
The rest of what you say contains several implicit questions and I cannot address all of them at this time, but I will say that the last paragraph does get my suggestion very, very wrong. It is about as far from what I tried to say in the paper, as it is possible to get.
I am not surprised. I am very familiar with the effect—often, what one person means when they write something is not what another person sees when they read it.
That paragraph is what I saw when I read our paper. I think that it is likely that our implicit assumptions about the structure of such an AI differ drastically—I suspect that you are not mentioning (or are only briefly hinting at) the parts you consider obvious, and I am not considering those parts obvious and therefore not seeing them present at all.
The AI has a massive number of constraints on its behavior, but they are all built in at the beginning,
This is incredibly important, and something that I did not see in your proposal. What is the nature of these constraints?
and in effect that cannot be changed because the change requires that the pre- state sanction the design of the post- state, and since the pre- state is safe at iteration n = 1, I can invoke the law of induction to conclude that it stays safe for all n > 1.
A properly Friendly AI will remain Friendly even as it improves its own intelligence. Agreed.
...I’m just not seeing how your proposed design brings us any closer to Friendliness.
It’s worth noting, here, that there have been many cases, throughout history, of someone misunderstanding someone else with tragic results. One example would be the Charge of the Light Brigade.
The danger, with superintelligent AI, is precisely that you end up with something that cannot be stopped. So, the very moment that it can no longer be stopped, then it can do what it likes, whether or not that’s what you like. Therefore, as you point out, safety systems must be prepared—sanity checks must be built in.
A simplistic sort of sanity check is “this AI must maximize human happiness, where ‘human happiness’ is defined in function HH-001”. The smiley tiler is the reason why this sort of sanity check does not work—when the AI’s overwhelming priority is smiles, you will get lots and lots of smiles.
What you appear to be advocating, on the other hand—and correct me if I’m wrong here—is an AI that has the complete ability to overwrite its own sanity checks and safety systems as it wishes. That is to say, it is an AI that has no constraints upon how it must act—an AI that is free to act as it likes, and that no-one (and no group of people) can stop if it decides (of its own free will) on a course of action that would, say, eradicate humanity. (I have no idea why it would want to, but I’m sure that you can see why such an outcome is to be avoided). You then seem to implicitly assume that such an AI will be friendly towards humanity as a whole.
A safety system can take the form of an unrewriteable overseer, .or the firm of corrigibility. There isnt a decisive .casseagainst the second approach, it is still under investigation.
A sufficiently powerful (which may be very powerful) AI is always free to act as it likes. Human beings can be killed. Software able to back itself up to any old rented cloud instances… it’s much harder.
Misunderstandings, as you say, can have large consequences even when small.
But the point at issue is whether a system can make, not a small misunderstanding, but the mother of all misunderstandings. (Subsequent consequences are moot, here, whether they be trivial or enormous, because I am only interested in what kind of system makes such misunderstandings). The comparison with the Charge of the Light Brigade mistake doesn’t work because it is too small, so we need to create a fictional mistake and examine that, to get an idea of what is involved. Suppose that after Kennedy gave his Go To The Moon speech, NASA worked on the project for several years, and then finally delivered a small family car, sized for three astronauts, which was capable of driving from the launch complex in Florida all the way up country to the little township of Moon, PA.
Now, THAT would be a misunderstanding comparable to the one we are talking about. And my question would then be about the competence of the head of NASA. Would that person have been smart enough to organize his way out of a paper bag? Would you believe that in the real world, that person could (a) make that kind of mistake in understanding the meaning of the word “Moon” and yet (b) at the same time be able to run the entire NASA organization?
So, the significance of the words of mine that you quote, above, is that I do not believe that there is a single rational person who would believe that a mistake of that sort by a NASA administrator would be consistent with that person being smart enough to be in that job. In fact, most people would say that such a person would not be smart enough to tie their own shoelaces.
The rest of what you say contains several implicit questions and I cannot address all of them at this time, but I will say that the last paragraph does get my suggestion very, very wrong. It is about as far from what I tried to say in the paper, as it is possible to get. The AI has a massive number of constraints on its behavior, but they are all built in at the beginning, and in effect that cannot be changed because the change requires that the pre- state sanction the design of the post- state, and since the pre- state is safe at iteration n = 1, I can invoke the law of induction to conclude that it stays safe for all n > 1.
The consequences of a misunderstanding are not a function of the size of the misunderstanding. Rather, they are a consequence of the ability that the acting agent (in this case, the AI) has to influence the world. A superintelligent AI has an unprecedentedly huge ability to influence the world, therefore, in the worst case, the potential consequences of a misunderstanding are unprecedentedly huge.
The nature of the misunderstanding—whether small or large—makes little difference here. And the nature of the problem, that is, communicating with a non-human and thus (presumably) alien in nature artificial intelligence, is rife with the potential for misunderstanding.
Considering that an artificial intelligence—at least at first—might well have immense computational ability and massive intelligence but little to no actual experience in understanding what people mean instead of what they say, this is precisely the sort of misunderstanding that is possible if the only mission objective given to the system is something along the lines of “get three astronauts (human) to location designated (Moon)”. (Presumably, instead of waiting several years, it would take a few minutes to order a rental car instead—assuming it knew about rental cars, or thought to look for them).
Now, if the AI is capable of solving the difficult problem of separating out what people mean from what they say—which is a problem that human-level intelligences still have immense trouble with at times—and the AI is compassionate enough towards humanity to actually strongly value human happiness (as opposed to assigning approximately as much value to us as we assign to ants), then yes, you’ve done it, you’ve got a perfectly wonderful Friendly AI.
The problem is, getting those two things right is not simple. I don’t think your proposed structure guarantees either of those.
I am not surprised. I am very familiar with the effect—often, what one person means when they write something is not what another person sees when they read it.
That paragraph is what I saw when I read our paper. I think that it is likely that our implicit assumptions about the structure of such an AI differ drastically—I suspect that you are not mentioning (or are only briefly hinting at) the parts you consider obvious, and I am not considering those parts obvious and therefore not seeing them present at all.
This is incredibly important, and something that I did not see in your proposal. What is the nature of these constraints?
A properly Friendly AI will remain Friendly even as it improves its own intelligence. Agreed.
...I’m just not seeing how your proposed design brings us any closer to Friendliness.