And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values.
Only by virtue of action being defined to be the result of non-disturbed calculation, which basically means that brain surgery is prohibited by the problem statement, or otherwise the agent is mistaken about its own nature (i.e. agent’s decision won’t make true the statement that the agent thinks it’d make true). Stability of values is more relevant when you consider replacing algorithms, evaluating expected actions of a different agent.
Well, obviously there has to be some amount of non-disturbed calculation at the start—the AI hardly has much chance if you nuke it while its Python interpreter is still loading up. But the first (and only) action that the AI returns may well result in the construction of another AI that’s better shielded from uncertainty about the universe and shares the same utility function. (For example, our initial AI has no concept of “time”—it outputs a result that’s optimized to work with the starting configuration of the universe as far as it knows—but that second-gen AI will presumably understand time, and other things besides.) I think that’s what will actually happen if you run a machine like the one I described in our world.
ETA. So here’s how you get goal stability: you build a piece of software that can find optima of utility functions, feed it your preferred utility function and your prior about the current state of the universe along with a quined description of the machine itself, give it just-barely-powerful-enough actuators, and make it output one single action. Wham, you’re done.
More generally, you can think of AI as a one-decision construction, that never observes anything and just outputs a program that is to be run next. It’s up to AI to design a good next program, while you, as AI’s designer, only need to make sure that AI constructs a highly optimized next program, while running on protected hardware. This way, knowledge of physics or physical protection of AI’s hardware or self-modification is not your problem, it’s AI’s.
The problem with this plan is that your AI needs to be able not just to construct an optimized next program, it needs to be able to construct a next program that is optimized enough, and it is you that must make sure that it’s possible. If you know that your AI is strong enough, then you’re done, but you generally don’t, and if your AI constructs a slightly suboptimal successor, and that successor does something a little bit stupid as well, so on it goes and by the trillionths step the world dies (if not just the AI).
Which is why it’s a good idea to not just say that AI is to do something optimized, but to have a more detailed idea about what exactly it could do, so that you can make sure it’s headed in the right direction without deviating from the goal. This is the problem of stable goal systems.
Your CA setup does nothing of the sort, and so makes no guarantees. The program is vulnerable not just while it’s loading.
All very good points. I completely agree. But I don’t yet know how to approach the harder problem you state. If physics is known perfectly and the initial AI uses a proof checker, we’re done, because math stays true even after a trillion steps. But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.
If physics is known perfectly and the first generation uses a proof checker to create the second, we’re done.
No, since you still run the risk of tiling the future with problem-solving machinery of no terminal value that never actually decides (and kills everyone in the process; it might even come to a good decision afterwards, but it’ll be too late for some of us—the Friendly AI of Doom that visibly only cares about Friendliness staying provable and not people, because it’s not yet ready to make a Friendly decision).
Also, FAI must already know physics perfectly (with uncertainty parametrized by observations). Problem of induction: observations are always interpreted according to a preexisting cognitive algorithm (more generally, logical theory). If AI doesn’t have the same theory of environment as we do, it’ll make different conclusions about the nature of the world than we’d do, given the same observations, and that’s probably not for the best if it’s to make optimal decisions according to what we consider real. Just as no moral arguments can persuade an AI to change its values, no observations can persuade an AI to change its idea of reality.
But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.
Presence of uncertainty is rarely a valid argument about possibility of making an optimal decision. You just make the best decision you can find given uncertainty that you’re dealt. Uncertainty is part of the problem anyway, and can as well be treated with precision.
Also, interesting thing happens if by the whim of the creator computer is given a goal of tiling universe with most common still life in it and universe is possibly infinite. It can be expected, that computer will send slower than light “investigation front” for counting encountered still life. Meanwhile it will have more and more space to put into prediction of possible treats for its mission. If it is sufficiently advanced, then it will notice possibility of existence of another agents, and that will naturally lead it to simulating possible interactions with non-still life, and to the idea that it can be deceived into believing that its “investigation front” reached borders of universe. Etc...
Thank you. It is something I can use for improvement.
Can you point at the flaws? I can see that the structure of sentences is overcomplicated, but I don’t know how it feels to native English speakers. Foreigner? Dork? Grammar Illiterate? I appreciate any feedback. Thanks.
Actually, a bit of all three. The one you can control the most is probably “dork”, which unpacks as “someone with complex ideas who is too impatient/show-offy to explain their idiosyncratic jargon”.
I’m a native English speaker, and I know that I still frequently sound “dorky” in that sense when I try to be too succinct.
Also, interesting thing happens if by the whim of the creator computer is given a goal of tiling universe with most common still life in it and universe is possibly infinite.
Respectfully, I don’t know what this sentence means. In particular, I don’t know what “most common still life” meant. That made it difficult to decipher the rest of the comment.
ETA: Thanks to the comment below, I understand a little better, but now I’m not sure what motivates invoking the possibility of other agents, given that the discussion was about proving Friendliness.
Only by virtue of action being defined to be the result of non-disturbed calculation, which basically means that brain surgery is prohibited by the problem statement, or otherwise the agent is mistaken about its own nature (i.e. agent’s decision won’t make true the statement that the agent thinks it’d make true). Stability of values is more relevant when you consider replacing algorithms, evaluating expected actions of a different agent.
Well, obviously there has to be some amount of non-disturbed calculation at the start—the AI hardly has much chance if you nuke it while its Python interpreter is still loading up. But the first (and only) action that the AI returns may well result in the construction of another AI that’s better shielded from uncertainty about the universe and shares the same utility function. (For example, our initial AI has no concept of “time”—it outputs a result that’s optimized to work with the starting configuration of the universe as far as it knows—but that second-gen AI will presumably understand time, and other things besides.) I think that’s what will actually happen if you run a machine like the one I described in our world.
ETA. So here’s how you get goal stability: you build a piece of software that can find optima of utility functions, feed it your preferred utility function and your prior about the current state of the universe along with a quined description of the machine itself, give it just-barely-powerful-enough actuators, and make it output one single action. Wham, you’re done.
More generally, you can think of AI as a one-decision construction, that never observes anything and just outputs a program that is to be run next. It’s up to AI to design a good next program, while you, as AI’s designer, only need to make sure that AI constructs a highly optimized next program, while running on protected hardware. This way, knowledge of physics or physical protection of AI’s hardware or self-modification is not your problem, it’s AI’s.
The problem with this plan is that your AI needs to be able not just to construct an optimized next program, it needs to be able to construct a next program that is optimized enough, and it is you that must make sure that it’s possible. If you know that your AI is strong enough, then you’re done, but you generally don’t, and if your AI constructs a slightly suboptimal successor, and that successor does something a little bit stupid as well, so on it goes and by the trillionths step the world dies (if not just the AI).
Which is why it’s a good idea to not just say that AI is to do something optimized, but to have a more detailed idea about what exactly it could do, so that you can make sure it’s headed in the right direction without deviating from the goal. This is the problem of stable goal systems.
Your CA setup does nothing of the sort, and so makes no guarantees. The program is vulnerable not just while it’s loading.
All very good points. I completely agree. But I don’t yet know how to approach the harder problem you state. If physics is known perfectly and the initial AI uses a proof checker, we’re done, because math stays true even after a trillion steps. But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.
No, since you still run the risk of tiling the future with problem-solving machinery of no terminal value that never actually decides (and kills everyone in the process; it might even come to a good decision afterwards, but it’ll be too late for some of us—the Friendly AI of Doom that visibly only cares about Friendliness staying provable and not people, because it’s not yet ready to make a Friendly decision).
Also, FAI must already know physics perfectly (with uncertainty parametrized by observations). Problem of induction: observations are always interpreted according to a preexisting cognitive algorithm (more generally, logical theory). If AI doesn’t have the same theory of environment as we do, it’ll make different conclusions about the nature of the world than we’d do, given the same observations, and that’s probably not for the best if it’s to make optimal decisions according to what we consider real. Just as no moral arguments can persuade an AI to change its values, no observations can persuade an AI to change its idea of reality.
Presence of uncertainty is rarely a valid argument about possibility of making an optimal decision. You just make the best decision you can find given uncertainty that you’re dealt. Uncertainty is part of the problem anyway, and can as well be treated with precision.
Also, interesting thing happens if by the whim of the creator computer is given a goal of tiling universe with most common still life in it and universe is possibly infinite. It can be expected, that computer will send slower than light “investigation front” for counting encountered still life. Meanwhile it will have more and more space to put into prediction of possible treats for its mission. If it is sufficiently advanced, then it will notice possibility of existence of another agents, and that will naturally lead it to simulating possible interactions with non-still life, and to the idea that it can be deceived into believing that its “investigation front” reached borders of universe. Etc...
Too smart to optimize.
One year and one level-up (thanks to ai-class.com) after this comment I’m still in the dark about the cause of downvoting the above comment.
I’m sorry for whining, but my curiosity took me over. Any comments?
It wasn’t me, but I suspect the poor grammar didn’t help. It makes it hard to understand what you were getting at.
Thank you. It is something I can use for improvement.
Can you point at the flaws? I can see that the structure of sentences is overcomplicated, but I don’t know how it feels to native English speakers. Foreigner? Dork? Grammar Illiterate? I appreciate any feedback. Thanks.
Actually, a bit of all three. The one you can control the most is probably “dork”, which unpacks as “someone with complex ideas who is too impatient/show-offy to explain their idiosyncratic jargon”.
I’m a native English speaker, and I know that I still frequently sound “dorky” in that sense when I try to be too succinct.
It is valuable information, thanks. I underestimated relative weight of communication style in the feedback I got.
Respectfully, I don’t know what this sentence means. In particular, I don’t know what “most common still life” meant. That made it difficult to decipher the rest of the comment.
ETA: Thanks to the comment below, I understand a little better, but now I’m not sure what motivates invoking the possibility of other agents, given that the discussion was about proving Friendliness.
In a cellular automaton, a still life is a pattern of cells which stays unchanged after each iteration.
Since you asked, your downvoted comment seems like word salad to me, I don’t understand sensible reasons that would motivate it.